CN109767757A

CN109767757A - A kind of minutes generation method and device

Info

Publication number: CN109767757A
Application number: CN201910038460.6A
Authority: CN
Inventors: 吴欢; 田甜
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-01-16
Filing date: 2019-01-16
Publication date: 2019-05-17
Also published as: WO2020147407A1

Abstract

The embodiment of the invention provides a kind of minutes generation method and devices.The present invention relates to field of artificial intelligence, this method comprises: obtaining conference voice；Conference voice is split, N number of sound bite is obtained, N is the natural number more than or equal to 2；N number of sound bite is clustered, the sound bite of M classification is obtained, M is the natural number more than or equal to 2, and M≤N, the sound bite of M classification is respectively with M spokesman with one-to-one relationship；Determine the corresponding spokesman of the sound bite of each classification in the sound bite of M classification；The speech content of each spokesman in M spokesman is determined according to the sound bite of M classification；Minutes are generated according to the speech content of spokesman each in M spokesman.Therefore, technical solution provided in an embodiment of the present invention is able to solve the problem of time-consuming and laborious manual sorting minutes in the prior art, low efficiency.

Description

A kind of minutes generation method and device

[technical field]

The present invention relates to field of artificial intelligence more particularly to a kind of minutes generation method and devices.

[background technique]

It in conference process, the speech content record of each spokesman of meeting and is arranged by record personnel, forms meeting View record.It is long when the time of meeting, when the content for needing to record is more, manual sorting minutes are time-consuming and laborious, Low efficiency.

[summary of the invention]

In view of this, the embodiment of the invention provides a kind of minutes generation method and devices, to solve existing skill The problem of manual sorting minutes are time-consuming and laborious in art, low efficiency.

On the one hand, the embodiment of the invention provides a kind of minutes generation methods, which comprises obtains meeting language Sound；The conference voice is split, N number of sound bite is obtained, N is the natural number more than or equal to 2；By N number of voice Segment is clustered, and the sound bite of M classification is obtained, and M is the natural number more than or equal to 2, M≤N, the language of the M classification Tablet section has one-to-one relationship with M spokesman respectively；Determine the language of each classification in the sound bite of the M classification The corresponding spokesman of tablet section；The hair of each spokesman in the M spokesman is determined according to the sound bite of the M classification Say content；Minutes are generated according to the speech content of each spokesman in the M spokesman.

Further, in the sound bite of the determination M classification each classification the corresponding speech of sound bite People, comprising: from respectively selected in the sound bite of each classification in the sound bite of the M classification at least one sound bite turn It changes text fragments into, obtains L text fragments, L is natural number, L >=M；The L text fragments and spokesman are shown to user List, the speech list include the information of each spokesman in the M spokesman；Receive matching instruction, the matching Instruction is that being used to indicate for user sending is matched by each text fragments in the L text fragments and spokesman's progress Instruction；The corresponding speech of sound bite of each classification in the sound bite of the M classification is determined according to the matching instruction People.

Further, in the sound bite of the determination M classification each classification the corresponding speech of sound bite People, comprising: from respectively selecting at least one sound bite in the sound bite of each classification in the sound bite of the M classification, Z sound bite is obtained, Z is natural number, Z >=M；The Z sound bite selected is played to user and shows spokesman List, the speech list include the information of each spokesman in the M spokesman；Receive matching instruction, the matching Instruction is that being used to indicate for user sending is matched by each sound bite in the Z sound bite and spokesman's progress Instruction；The corresponding speech of sound bite of each classification in the sound bite of the M classification is determined according to the matching instruction People.

Further, described to cluster N number of sound bite, comprising: S1: from N number of sound bite with Machine selects M sound bite, using the M sound bite selected as the cluster centre of M classification；S2: for remaining N-M I-th of sound bite in sound bite, calculate i-th of sound bite and each cluster centre in M cluster centre it Between distance, and i-th of sound bite be referred to corresponding apart from nearest cluster centre with i-th of sound bite Classification in, i successively takes 1 to the natural number between N-M；S3: after the M sound bite, which is sorted out, to be completed, according to the M The sound bite that each classification includes in a classification recalculates the cluster centre of the M classification, and updates the M classification Cluster centre, circulation execute S2 and S3, until the distance of the adjacent cluster centre twice of each classification in the M classification exists Within pre-determined distance.

Further, described to be split the conference voice, obtain N number of sound bite, comprising: determine the meeting Silence clip in voice；Remove the silence clip in the conference voice；It is described mute to removing according to the silence clip Conference voice after segment is split, and obtains W long sound bites, and W is the natural number more than or equal to 2, W < N；Described in extraction The acoustic feature of each long sound bite in W long sound bites；To each long voice sheet in the W long sound bites The acoustic feature of section carries out Relative Entropy Analysis；Cutting is carried out to the W long sound bites according to the result of Relative Entropy Analysis, is obtained To N number of sound bite.

On the one hand, the embodiment of the invention provides a kind of minutes generating means, described device includes: acquiring unit, For obtaining conference voice；Cutting unit obtains N number of sound bite for the conference voice to be split, N be greater than Natural number equal to 2；Cluster cell obtains the sound bite of M classification, M for clustering N number of sound bite For the natural number more than or equal to 2, M≤N, the sound bite of the M classification has to correspond with M spokesman respectively to be closed System；First determination unit, the corresponding speech of sound bite of each classification in the sound bite for determining the M classification People；Second determination unit, for determining each spokesman in the M spokesman according to the sound bite of the M classification Speech content；Generation unit, for generating minutes according to the speech content of each spokesman in the M spokesman.

Further, first determination unit includes: first choice subelement, for the voice from the M classification It respectively selects at least one sound bite to be converted into text fragments in the sound bite of each classification in segment, obtains L text piece Section, L is natural number, L >=M；First shows subelement, for showing the L text fragments and speech list, institute to user State the information that speech list includes each spokesman in the M spokesman；First receiving subelement refers to for receiving matching It enables, the matching instruction is being used to indicate each text fragments in the L text fragments and making a speech for user sending People carries out matched instruction；First determines subelement, for determining the sound bite of the M classification according to the matching instruction In each classification the corresponding spokesman of sound bite.

Further, first determination unit includes: the second selection subelement, for the voice from the M classification At least one sound bite is respectively selected in segment in the sound bite of each classification, obtains Z sound bite, Z is natural number, Z ≥M；Second shows subelement, described for playing the Z sound bite selected to user and showing speech list Speech list includes the information of each spokesman in the M spokesman；Second receiving subelement refers to for receiving matching It enables, the matching instruction is being used to indicate each sound bite in the Z sound bite and making a speech for user sending People carries out matched instruction；Second determines subelement, for determining the sound bite of the M classification according to the matching instruction In each classification the corresponding spokesman of sound bite.

Further, the cluster cell is for executing following steps: S1: randomly choosing M from N number of sound bite A sound bite, using the M sound bite selected as the cluster centre of M classification；S2: for remaining N-M voice sheet I-th of sound bite in section, calculate in i-th of sound bite and M cluster centre between each cluster centre away from From, and i-th of sound bite is referred to i-th of sound bite apart from the corresponding classification of nearest cluster centre In, i successively takes 1 to the natural number between N-M；S3: after the M sound bite, which is sorted out, to be completed, according to the M classification In each classification sound bite for including recalculate the cluster centre of the M classification, and update the cluster of the M classification Center, circulation execute S2 and S3, until the adjacent cluster centre twice of each classification in the M classification distance preset away from From within.

Further, the cutting unit includes: that third determines subelement, mute in the conference voice for determining Segment；Subelement is removed, for removing the silence clip in the conference voice；Divide subelement, for according to described mute Segment is split the conference voice after removing the silence clip, obtains W long sound bites, and W is oneself more than or equal to 2 So number, W < N；Subelement is extracted, for extracting the acoustic feature of each long sound bite in the W long sound bites；Phase Subelement is analyzed to entropy, carries out relative entropy for the acoustic feature to each long sound bite in the W long sound bites Analysis；Cutting subelement obtains the N for carrying out cutting to the W long sound bites according to the result of Relative Entropy Analysis A sound bite.

On the one hand, the embodiment of the invention provides a kind of storage medium, the storage medium includes the program of storage, In, equipment where controlling the storage medium in described program operation executes above-mentioned minutes generation method.

On the one hand, the embodiment of the invention provides a kind of computer equipment, including memory and processor, the memories For storing the information including program instruction, the processor is used to control the execution of program instruction, and described program instruction is located The step of reason device loads and realizes above-mentioned minutes generation method when executing.

In embodiments of the present invention, it will view voice is split, and obtains N number of sound bite, N number of sound bite is carried out Cluster, obtains the sound bite of M classification, determines the corresponding spokesman of the sound bite of each classification；According to the language of M classification Tablet section determines the speech content of M spokesman；According to the speech content of each spokesman, minutes are generated, are solved existing There is in technology the problem of time-consuming and laborious manual sorting minutes, low efficiency, reached the speech content in intellectual analysis meeting, Efficiently sort out the effect of minutes.

[Detailed description of the invention]

In order to illustrate the technical solution of the embodiments of the present invention more clearly, below will be to needed in the embodiment attached Figure is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this field For those of ordinary skill, without any creative labor, it can also be obtained according to these attached drawings other attached Figure.

Fig. 1 is a kind of flow chart of optional minutes generation method according to embodiments of the present invention；

Fig. 2 is a kind of schematic diagram of optional minutes generating means according to embodiments of the present invention；

Fig. 3 is a kind of schematic diagram of optional computer equipment provided in an embodiment of the present invention.

[specific embodiment]

For a better understanding of the technical solution of the present invention, being retouched in detail to the embodiment of the present invention with reference to the accompanying drawing It states.

It will be appreciated that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.Base Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts it is all its Its embodiment, shall fall within the protection scope of the present invention.

The term used in embodiments of the present invention is only to be not intended to be limiting merely for for the purpose of describing particular embodiments The present invention.In the embodiment of the present invention and the "an" of singular used in the attached claims, " described " and "the" It is also intended to including most forms, unless the context clearly indicates other meaning.

It should be appreciated that term "and/or" used herein is only a kind of incidence relation for describing affiliated partner, indicate There may be three kinds of relationships, for example, A and/or B, can indicate: individualism A, exist simultaneously A and B, individualism B these three Situation.In addition, character "/" herein, typicallys represent the relationship that forward-backward correlation object is a kind of "or".

Fig. 1 is a kind of flow chart of optional minutes generation method according to embodiments of the present invention, as shown in Figure 1, should Method includes:

Step S102 obtains conference voice.

Step S104, it will view voice is split, and obtains N number of sound bite, and N is the natural number more than or equal to 2.

Step S106 clusters N number of sound bite, obtains the sound bite of M classification, and M is oneself more than or equal to 2 So number, M≤N, the sound bite of M classification have one-to-one relationship with M spokesman respectively.

Step S108 determines the corresponding spokesman of the sound bite of each classification in the sound bite of M classification.

Step S110 determines the speech content of each spokesman in M spokesman according to the sound bite of M classification.

Step S112 generates minutes according to the speech content of spokesman each in M spokesman.

In embodiments of the present invention, meeting may include two kinds of situations:

The first situation: the case where all personnels participating in the meeting show up.For example, certain department has held a meeting, owner Gather and has a meeting in the same meeting room.

Second situation: the case where being had a meeting by means of certain application software.For example, certain department has held primary meeting View, some gather in the same meeting room, other people go on business in other places, participate in meeting by wechat, QQ or other application software View.For another example certain company has held a meeting, personnel participating in the meeting has 3, respectively Pekinese line manager, Shanghai department It handles, the line manager in Shenzhen, the geographical location of these people is located at Beijing, Shanghai, Shenzhen, these three people are in different cities City passes through wechat, QQ or other application software meeting.

In embodiments of the present invention, the language that conference voice generates during can having a meeting for a kind of mode of any of the above Sound.Scene is recorded when conference voice can be in session, the meeting for example, several people get together, one of people mobile phone, The voice generated during recorder, recording pen or other sound pick-up outfit recorded meetings obtains conference voice；Conference voice Can be by record had a meeting by instant message applications during the voice that generates obtain, for example, several people pass through it is micro- Letter/QQ meeting, what one of people mobile phone, recorder, recording pen or other sound pick-up outfit recorded meetings generated in the process Wechat voice/QQ voice obtains conference voice.

In embodiments of the present invention, spokesman refers to the people to make a speech in conference process, the quantity of spokesman be less than or Equal to the quantity of personnel participating in the meeting, if all personnels participating in the meeting make a speech, the quantity of spokesman is equal to the number of personnel participating in the meeting Amount；If only part personnel participating in the meeting makes a speech, the quantity of spokesman is less than the quantity of personnel participating in the meeting.

Minutes generation method provided in an embodiment of the present invention is illustrated for a specific example below.

For example, several personal meetings, the voice that recorded meeting generates in the process obtain conference voice, it is assumed that conference voice It is 20 minutes, it will view voice is split, such as obtains 6000 (N=6000) a sound bites, by this 6000 voice sheets Duan Jinhang cluster, obtains the sound bite of 3 (M=3) a classifications, wherein classification 1 includes 3000 sound bites, respectively language Tablet section P (1,1), sound bite P (1,2) ..., sound bite P (1,3000), this 3000 sound bites correspond to same Spokesman；Classification 2 include 1000 sound bites, respectively sound bite P (2,1), sound bite P (2,2) ..., voice Segment P (2,1000), this 1000 sound bites correspond to the same spokesman；Classification 3 includes 2000 sound bites, respectively Sound bite P (3,1), sound bite P (3,2) ..., sound bite P (3,2000), this 2000 sound bites correspond to same A spokesman.Then, the corresponding spokesman of the sound bite of each classification is determined respectively, for example, it is assumed that determining that classification 1 includes 3000 sound bites correspond to spokesman's first, 1000 sound bites that classification 2 includes correspond to spokesman's second, and classification 3 includes 2000 sound bites correspond to spokesman third, as shown in table 1.According to sound bite P (1,1), sound bite P (1, 2) ..., sound bite P (1,3000) determines the speech content of spokesman's first；According to sound bite P (2,1), sound bite P (2,2) ..., sound bite P (2,1000) determine the speech content of spokesman's second；According to sound bite P (3,1), voice sheet Section P (3,2) ..., sound bite P (3,2000) determine the speech content of spokesman third.According to spokesman first, spokesman second and The speech content of spokesman third generates minutes.

Table 1

Determine the corresponding spokesman of the sound bite of each classification, specific method can there are many, enumerate below several.

Method one:

At least one sound bite is respectively selected to be converted into from the sound bite of classification each in the sound bite of M classification Text fragments, obtain L text fragments, and L is natural number, L >=M；L text fragments and speech list, hair are shown to user Speech list includes the information of each spokesman in M spokesman；Matching instruction is received, matching instruction is being used for for user's sending Text fragments each in L text fragments and spokesman are carried out matched instruction by instruction；M classification is determined according to matching instruction Sound bite in each classification the corresponding spokesman of sound bite.

At least one sound bite is respectively selected to be converted into from the sound bite of classification each in the sound bite of M classification Text fragments obtain L text fragments, specifically, can be from the sound bite of M classification in the sound bite of each classification It respectively randomly chooses at least one sound bite and is converted into text fragments.

For example, respectively selecting a voice from the sound bite of classification each in the sound bite of 3 classifications shown in table 1 Segment, the sound bite selected from classification 1 are sound bite P (1,1)；The sound bite selected from classification 2 is voice sheet Section P (2,1)；The sound bite selected from classification 3 is sound bite P (3,1).This 3 sound bites are converted respectively, Obtain text fragments F (1,1), text fragments F (2,1), text fragments F (3,1), these three text fragments and above three voice Corresponding relationship between segment is as shown in table 2.

Table 2

Sound bite	The text fragments that sound bite converts
		Sound bite P (1,1)	Text fragments F (1,1)
Sound bite P (2,1)	Text fragments F (2,1)
		Sound bite P (3,1)	Text fragments F (3,1)

Show that this 3 (L=3) a text fragments and speech list, speech list include every in 3 spokesman to user The information of a spokesman.The information of spokesman may include the name of spokesman, position etc..

User can be the host of meeting, can be other personnels participating in the meeting.

After user sees this 3 text fragments, that is, may know that corresponding with text fragments is the hair of who personnel participating in the meeting Speech.For example, it is assumed that having the content of a text fragments is: " hello, I is the host of meeting today." see this as user After a text fragments, that is, may know that this text fragments it is corresponding be meeting presider speech.The capable of emitting matching instruction of user, Matching instruction is to be used to indicate each text fragments in 3 text fragments and spokesman carrying out matched instruction, for example, matching Instruction instruction matches text fragments with spokesman according to table 3.

Table 3

Text fragments	The corresponding spokesman of text fragments
		Text fragments F (1,1)	First
Text fragments F (2,1)	Second
		Text fragments F (3,1)	Third

Since text fragments F (1,1) is that sound bite in classification 1 is converted to, and all voice sheets in classification 1 Corresponding section is the same spokesman, and therefore, the corresponding spokesman's first of text fragments F (1,1) is all voices in classification 1 The corresponding spokesman of segment, that is, all sound bites in classification 1 are all that spokesman's first issues；Similarly, due to text fragments F (2,1) is that the sound bite in classification 2 is converted to, and all sound bites in classification 2 are all that spokesman's second issues； Similarly, since text fragments F (3,1) is that sound bite in classification 3 is converted to, all sound bites in classification 3 are all It is that spokesman third issues, the corresponding relationship between sound bite and spokesman is as shown in table 4.

Table 4

Method two:

At least one sound bite is respectively selected from the sound bite of classification each in the sound bite of M classification, obtains Z A sound bite, Z are natural number, Z >=M；The Z sound bite selected is played to user and shows speech list, is made a speech List includes the information of each spokesman in M spokesman；Matching instruction is received, matching instruction is user's sending for referring to Show and sound bite each in Z sound bite and spokesman are subjected to matched instruction；M classification is determined according to matching instruction The corresponding spokesman of the sound bite of each classification in sound bite.

At least one sound bite is respectively selected from the sound bite of classification each in the sound bite of M classification, specifically Ground can respectively randomly choose at least one sound bite from the sound bite of M classification in the sound bite of each classification.

Assuming that randomly choosed 2 sound bites from classification 1, respectively sound bite F (1,32), sound bite F (1, 450)；2 sound bites, respectively sound bite F (2,100), sound bite F (2,400) have been randomly choosed from classification 2； 2 sound bites, respectively sound bite F (3,900), sound bite F (3,600) have been randomly choosed from classification 3.

This 6 (Z=6) a sound bite is played to user, after user hears this 6 sound bites, according to the tone color of sound It can easily identify that each sound bite is the speech of who personnel participating in the meeting.The capable of emitting matching instruction of user, matching instruction are to use Each sound bite in 6 sound bites and spokesman are subjected to matched instruction in instruction, matching way is as shown in table 5.

Table 5

Sound bite	The corresponding spokesman of sound bite
		Sound bite F (1,32), sound bite F (Isosorbide-5-Nitrae 50)	First
Sound bite F (2,100), sound bite F (2,400)	Second
		Sound bite F (3,900), sound bite F (3,600)	Third

Since sound bite F (1,32), sound bite F (Isosorbide-5-Nitrae 50) they are the sound bites in classification 1, and the institute in classification 1 Corresponding sound bite is the same spokesman, therefore, sound bite F (1,32), the corresponding hair of sound bite F (Isosorbide-5-Nitrae 50) Yan Renjia is the corresponding spokesman of all sound bites in classification 1, that is, all sound bites in classification 1 are all speeches What people's first issued；Similarly, due to the sound bite that sound bite F (2,100), sound bite F (2,400) are in classification 2, classification All sound bites in 2 are all that spokesman's second issues；Similarly, due to sound bite F (3,900), sound bite F (3, It 600) is sound bite in classification 3, all sound bites in classification 3 are all that spokesman third issues, sound bite and hair Say that the corresponding relationship between people is as shown in table 4.

In embodiments of the present invention, by being clustered the corresponding sound bite of the same spokesman to one according to clustering algorithm It rises, one or more sound bites is then randomly choosed from each classification, the sound bite of selection is played to user, asks user Sound bite is carried out with spokesman corresponding；Or the sound bite selected is converted into text fragments, text is shown to user This segment asks user to carry out text fragments with spokesman corresponding, very simple and convenient, does not need the sound for knowing spokesman in advance Line feature or the relevant feature of other sound.

N number of sound bite clusters to detailed process is as follows:

S1: M sound bite is randomly choosed from N number of sound bite, using the M sound bite selected as M classification Cluster centre；S2: for i-th of sound bite in remaining N-M sound bite, i-th of sound bite and M are calculated The distance between each cluster centre in cluster centre, and i-th of sound bite is referred to i-th of sound bite distance most In the corresponding classification of close cluster centre, i is successively taken 1 to the natural number between N-M；S3: when M sound bite sorts out completion Afterwards, the cluster centre of M classification is recalculated according to the sound bite that classification each in M classification includes, and updates M classification Cluster centre, circulation execute S2 and S3, until classification each in M classification adjacent cluster centre twice distance presetting Within distance.

In embodiments of the present invention, K-means algorithm can be used to cluster sound bite.M is the number of spokesman Amount, the quantity can be provided by meeting presider or other personnels participating in the meeting.

K-means algorithm is the evaluation index very typically based on the clustering algorithm of distance, using distance as similitude, Think that the distance of two objects is closer, similarity is bigger.The algorithm think cluster by forming apart from close object, Therefore handle obtains compact and independent cluster as final goal.The selection of initial classes cluster centre point has cluster result larger Influence because in the algorithm first step being a object conduct of any k of random selection (in embodiments of the present invention, k=M) The center of initial clustering initially represents a cluster.The algorithm concentrates remaining each object, root to data in each iteration Each object is assigned to nearest cluster again at a distance from each cluster center according to it.After having investigated all data objects, once Interative computation is completed, and new cluster centre is computed.If before and after an iteration, new mass center it is equal with the protoplasm heart or Less than specified threshold, algorithm terminates.In embodiments of the present invention, when the adjacent cluster centre twice of classification each in M classification Distance within pre-determined distance when, circulation terminates, and obtains cluster result.

Above-mentioned steps S2: for i-th of sound bite in remaining N-M sound bite, i-th of sound bite is calculated The distance between cluster centre each in M cluster centre, can be calculated by vocal print feature, and detailed process can be with It is: extracts the vocal print feature of i-th of sound bite (sound bite to be clustered)；Extract each cluster in M cluster centre The vocal print feature at center；The vocal print feature of i-th of sound bite and the vocal print of each cluster centre in M cluster centre is special Sign carries out similarity calculation, using calculated similarity as the distance between i-th of sound bite and cluster centre.

Since everyone is different from the related anatomical structure that pronounces, and by socioeconomic status, level of education, birth Ground etc. influences, and the vocal print feature of different people is not exactly the same.The vocal print feature extracted in the embodiment of the present invention can be special for the rhythm Sign.Tone color, loudness of a sound, pitch etc., the collectively referred to as prosodic features of voice, also known as super-segmental feature.Loudness of a sound shows the stress, light of voice The strong and weak variation such as sound, pitch show the word tune and intonation of voice.

In embodiments of the present invention, the vocal print feature for extracting sound bite gathers sound bite by vocal print feature Class gets together the high sound bite of vocal print feature similarity, as the sound bite that the same spokesman issues, at this In the process, it and needs to be known in advance the vocal print feature of spokesman, does not need the vocal print feature that spokesman is stored in advance more, protect The privacy of spokesman, highly-safe, user experience is good.

Optionally, it will view voice is split, and obtains N number of sound bite, comprising: determine the mute plate in conference voice Section；Remove the silence clip in conference voice；The conference voice after removal silence clip is split according to silence clip, is obtained To W long sound bites, W is the natural number more than or equal to 2, W < N；Extract each long sound bite in W long sound bites Acoustic feature；Relative Entropy Analysis is carried out to the acoustic feature of each long sound bite in W long sound bites；According to opposite The result of entropy analysis carries out cutting to W long sound bites, obtains N number of sound bite.

Optionally, Relative Entropy Analysis is carried out to the acoustic feature of each long voice segments；According to the result of Relative Entropy Analysis Cutting is carried out to long voice segments, comprising: framing is carried out to long voice segments, the speech frame of long voice segments is obtained, extracts speech frame Acoustic feature, to acoustic feature carry out Relative Entropy Analysis, determine at the maximum value of relative entropy, judge long voice segments duration whether Greater than preset duration；If the duration of long voice segments is greater than preset duration, long voice segments are carried out at the maximum value of relative entropy Cutting.

In probability theory or information theory, relative entropy (relative entropy), also known as KL divergence (Kullback- Leibler divergence), it is a kind of method for describing two probability distribution P and Q difference.It is asymmetrical, this meaning D (P | | Q) ≠ D (Q | | P).Particularly, in information theory, and D (P | | Q) it indicates that true distribution P ought be fitted with probability distribution Q When, the information loss of generation, wherein P indicates true distribution, and Q indicates the fitting distribution of P.

For two probability distribution P and Q of a discrete random variable, their KL divergence is defined as: D (P | | Q)= ∑ P (i) lnP (i)/Q (i) defines similar continuous random variable.

Fig. 2 is a kind of schematic diagram of optional minutes generating means according to embodiments of the present invention, and the device is for holding The above-mentioned minutes generation method of row, as shown in Fig. 2, the device include: acquiring unit 10, cutting unit 20, cluster cell 30, First determination unit 40, the second determination unit 50, generation unit 60.

Acquiring unit 10, for obtaining conference voice.

Cutting unit 20 obtains N number of sound bite, N is the nature more than or equal to 2 for conference voice to be split Number.

Cluster cell 30 obtains the sound bite of M classification for clustering N number of sound bite, M be greater than etc. In 2 natural number, M≤N, the sound bite of M classification has one-to-one relationship with M spokesman respectively.

First determination unit 40, the corresponding hair of sound bite of each classification in the sound bite for determining M classification Say people.

Second determination unit 50, for determining the hair of each spokesman in M spokesman according to the sound bite of M classification Say content.

Generation unit 60, for generating minutes according to the speech content of spokesman each in M spokesman.

Optionally, the first determination unit 40 includes: first choice subelement, the first displaying subelement, the first reception son list Member, first determine subelement.First choice subelement, the sound bite for classification each from the sound bite of M classification In respectively select at least one sound bite to be converted into text fragments, obtain L text fragments, L is natural number, L >=M.First exhibition Show subelement, for showing that L text fragments and speech list, speech list include each in M spokesman to user The information of spokesman.First receiving subelement, for receiving matching instruction, matching instruction is used to indicate for what user issued by L Each text fragments and spokesman carry out matched instruction in a text fragments.First determines subelement, for being referred to according to matching Enable the corresponding spokesman of sound bite of each classification in the sound bite for determining M classification.

Optionally, the first determination unit 40 includes: the second selection subelement, the second displaying subelement, the second reception son list Member, second determine subelement.Second selection subelement, the sound bite for classification each from the sound bite of M classification In respectively select at least one sound bite, obtain Z sound bite, Z is natural number, Z >=M.Second show subelement, for User plays the Z sound bite selected and shows speech list, and speech list includes each speech in M spokesman The information of people.Second receiving subelement, for receiving matching instruction, matching instruction is used to indicate for what user issued by Z language Each sound bite and spokesman carry out matched instruction in tablet section.Second determines subelement, for true according to matching instruction Determine the corresponding spokesman of sound bite of each classification in the sound bite of M classification.

Optionally, cluster cell is for executing following steps: S1: M voice sheet is randomly choosed from N number of sound bite Section, using the M sound bite selected as the cluster centre of M classification.S2: for i-th in remaining N-M sound bite A sound bite, calculates the distance between i-th sound bite and each cluster centre in M cluster centre, and by i-th of language Tablet section is referred to i-th of sound bite in the corresponding classification of nearest cluster centre, and i successively takes 1 between N-M Natural number.S3: it after M sound bite, which is sorted out, to be completed, is counted again according to the sound bite that classification each in M classification includes The cluster centre of M classification is calculated, and updates the cluster centre of M classification.Circulation executes S2 and S3, until each in M classification The distance of the adjacent cluster centre twice of classification is within pre-determined distance.

Optionally, cutting unit 20 includes: that third determines subelement, removal subelement, segmentation subelement, extracts son list Member, Relative Entropy Analysis subelement, cutting subelement.Third determines subelement, for determining the silence clip in conference voice.It goes Except subelement, for removing the silence clip in conference voice.Divide subelement, is used for according to silence clip to removal mute plate Conference voice after section is split, and obtains W long sound bites, and W is the natural number more than or equal to 2, W < N.It is single to extract son Member, for extracting the acoustic feature of each long sound bite in W long sound bites.Relative Entropy Analysis subelement, for W The acoustic feature of each long sound bite carries out Relative Entropy Analysis in a long sound bite.Cutting subelement, for according to phase Cutting is carried out to W long sound bites to the result of entropy analysis, obtains N number of sound bite.

On the one hand, the embodiment of the invention provides a kind of storage medium, storage medium includes the program of storage, wherein Equipment where control storage medium executes following steps when program is run: obtaining conference voice；Conference voice is split, is obtained To N number of sound bite, N is the natural number more than or equal to 2；N number of sound bite is clustered, the voice sheet of M classification is obtained Section, M are the natural number more than or equal to 2, and M≤N, the sound bite of M classification has to correspond with M spokesman respectively to close System；Determine the corresponding spokesman of the sound bite of each classification in the sound bite of M classification；According to the voice sheet of M classification Section determines the speech content of each spokesman in M spokesman；It is generated according to the speech content of spokesman each in M spokesman Minutes.

Optionally, when program is run, equipment where control storage medium also executes following steps: from the voice of M classification It respectively selects at least one sound bite to be converted into text fragments in the sound bite of each classification in segment, obtains L text piece Section, L is natural number, L >=M；L text fragments and speech list are shown to user, and speech list includes in M spokesman The information of each spokesman；Matching instruction is received, matching instruction is that being used to indicate for user's sending will be each in L text fragments Text fragments and spokesman carry out matched instruction；Each classification in the sound bite of M classification is determined according to matching instruction The corresponding spokesman of sound bite.

Optionally, when program is run, equipment where control storage medium also executes following steps: from the voice of M classification At least one sound bite is respectively selected in segment in the sound bite of each classification, obtains Z sound bite, Z is natural number, Z ≥M；The Z sound bite selected is played to user and shows speech list, and speech list includes every in M spokesman The information of a spokesman；Matching instruction is received, matching instruction is used to indicate for what user issued by language each in Z sound bite Tablet section and spokesman carry out matched instruction；The language of each classification in the sound bite of M classification is determined according to matching instruction The corresponding spokesman of tablet section.

Optionally, when program is run, equipment where control storage medium also executes following steps: S1: from N number of voice sheet M sound bite is randomly choosed in section, using the M sound bite selected as the cluster centre of M classification；S2: for residue N-M sound bite in i-th of sound bite, calculate i-th sound bite and each cluster centre in M cluster centre The distance between, and i-th of sound bite is referred to i-th of sound bite apart from the corresponding classification of nearest cluster centre In, i successively takes 1 to the natural number between N-M；S3: after M sound bite, which is sorted out, to be completed, according to class each in M classification The sound bite for not including recalculates the cluster centre of M classification, and updates the cluster centre of M classification, and circulation executes S2 And S3, until classification each in M classification adjacent cluster centre twice distance within pre-determined distance.

Optionally, when program is run, equipment where control storage medium also executes following steps: determining in conference voice Silence clip；Remove the silence clip in conference voice；According to silence clip to removal silence clip after conference voice into Row segmentation, obtains W long sound bites, and W is the natural number more than or equal to 2, W < N；Extract each in W long sound bites The acoustic feature of long sound bite；Relative entropy point is carried out to the acoustic feature of each long sound bite in W long sound bites Analysis；Cutting is carried out to W long sound bites according to the result of Relative Entropy Analysis, obtains N number of sound bite.

On the one hand, the embodiment of the invention provides a kind of computer equipments, including memory and processor, memory to be used for Storage includes the information of program instruction, and processor is used to control the execution of program instruction, and program instruction is loaded and held by processor Acquisition conference voice is performed the steps of when row；Conference voice is split, N number of sound bite is obtained, N is more than or equal to 2 Natural number；N number of sound bite is clustered, obtains the sound bite of M classification, M is the natural number more than or equal to 2, M≤ The sound bite of N, M classifications has one-to-one relationship with M spokesman respectively；It determines every in the sound bite of M classification The corresponding spokesman of the sound bite of a classification；Each spokesman in M spokesman is determined according to the sound bite of M classification Speech content；Minutes are generated according to the speech content of spokesman each in M spokesman.

Optionally, the sound bite from M classification is also performed the steps of when program instruction is loaded and executed by processor In each classification sound bite in respectively select at least one sound bite to be converted into text fragments, obtain L text fragments, L For natural number, L >=M；Show that L text fragments and speech list, speech list include each in M spokesman to user The information of spokesman；Matching instruction is received, matching instruction is used to indicate for what user issued by text each in L text fragments Segment and spokesman carry out matched instruction；The voice of each classification in the sound bite of M classification is determined according to matching instruction The corresponding spokesman of segment.

Optionally, the sound bite from M classification is also performed the steps of when program instruction is loaded and executed by processor In each classification sound bite in respectively select at least one sound bite, obtain Z sound bite, Z is natural number, Z >=M； The Z sound bite selected is played to user and shows speech list, and speech list includes each hair in M spokesman Say the information of people；Matching instruction is received, matching instruction is used to indicate for what user issued by voice sheet each in Z sound bite Section carries out matched instruction with spokesman；The voice sheet of each classification in the sound bite of M classification is determined according to matching instruction The corresponding spokesman of section.

Optionally, S1 is also performed the steps of when program instruction is loaded and executed by processor: from N number of sound bite M sound bite is randomly choosed, using the M sound bite selected as the cluster centre of M classification；S2: for remaining N-M I-th of sound bite in a sound bite calculates in i-th of sound bite and M cluster centre between each cluster centre Distance, and i-th of sound bite is referred to i-th of sound bite in the corresponding classification of nearest cluster centre, i It successively takes 1 to the natural number between N-M；S3: after M sound bite, which is sorted out, to be completed, according to classification packet each in M classification The sound bite included recalculates the cluster centre of M classification, and updates the cluster centre of M classification, and circulation executes S2 and S3, Until classification each in M classification adjacent cluster centre twice distance within pre-determined distance.

Optionally, it is also performed the steps of when program instruction is loaded and executed by processor quiet in determining conference voice Tablet section；Remove the silence clip in conference voice；The conference voice after removal silence clip is divided according to silence clip It cuts, obtains W long sound bites, W is the natural number more than or equal to 2, W < N；Extract each long language in W long sound bites The acoustic feature of tablet section；Relative Entropy Analysis is carried out to the acoustic feature of each long sound bite in W long sound bites；Root Cutting is carried out to W long sound bites according to the result of Relative Entropy Analysis, obtains N number of sound bite.

Fig. 3 is a kind of schematic diagram of computer equipment provided in an embodiment of the present invention.As shown in figure 3, the meter of the embodiment Machine equipment 50 is calculated to include: processor 51, memory 52 and be stored in the meter that can be run in memory 52 and on processor 51 Calculation machine program 53 realizes the minutes generation method in embodiment, to keep away when the computer program 53 is executed by processor 51 Exempt to repeat, not repeat one by one herein.Alternatively, realizing that minutes are raw in embodiment when the computer program is executed by processor 51 It is not repeated one by one herein at the function of model/unit each in device to avoid repeating.

Computer equipment 50 can be desktop PC, notebook, palm PC and cloud server etc. and calculate equipment. Computer equipment may include, but be not limited only to, processor 51, memory 52.It will be understood by those skilled in the art that Fig. 3 is only It is the example of computer equipment 50, does not constitute the restriction to computer equipment 50, may include more more or fewer than illustrating Component perhaps combines certain components or different components, such as computer equipment can also include input-output equipment, net Network access device, bus etc..

Alleged processor 51 can be central processing unit (Central Processing Unit, CPU), can also be Other general processors, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field- Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic, Discrete hardware components etc..General processor can be microprocessor or the processor is also possible to any conventional processor Deng.

Memory 52 can be the internal storage unit of computer equipment 50, such as the hard disk or interior of computer equipment 50 It deposits.Memory 52 is also possible to the plug-in type being equipped on the External memory equipment of computer equipment 50, such as computer equipment 50 Hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) etc..Further, memory 52 can also both including computer equipment 50 internal storage unit and also including External memory equipment.Memory 52 is for storing other programs and data needed for computer program and computer equipment.It deposits Reservoir 52 can be also used for temporarily storing the data that has exported or will export.

It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description, The specific work process of device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.

In several embodiments provided by the present invention, it should be understood that disclosed system, device and method can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit It divides, only a kind of logical function partition, there may be another division manner in actual implementation, for example, multiple units or group Part can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown Or the mutual coupling, direct-coupling or communication connection discussed can be through some interfaces, device or unit it is indirect Coupling or communication connection can be electrical property, mechanical or other forms.

The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.

It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of hardware adds SFU software functional unit.

The above-mentioned integrated unit being realized in the form of SFU software functional unit can store and computer-readable deposit at one In storage media.Above-mentioned SFU software functional unit is stored in a storage medium, including some instructions are used so that a computer It is each that device (can be personal computer, server or network equipment etc.) or processor (Processor) execute the present invention The part steps of embodiment the method.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (Read- Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic or disk etc. it is various It can store the medium of program code.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Within mind and principle, any modification, equivalent substitution, improvement and etc. done be should be included within the scope of the present invention.

Claims

1. a kind of minutes generation method, which is characterized in that the described method includes:

Obtain conference voice；

The conference voice is split, N number of sound bite is obtained, N is the natural number more than or equal to 2；

N number of sound bite is clustered, obtains the sound bite of M classification, M is the natural number more than or equal to 2, M≤ N, the sound bite of the M classification have one-to-one relationship with M spokesman respectively；

Determine the corresponding spokesman of sound bite of each classification in the sound bite of the M classification；

The speech content of each spokesman in the M spokesman is determined according to the sound bite of the M classification；

Minutes are generated according to the speech content of each spokesman in the M spokesman.

2. the method according to claim 1, wherein each in the sound bite of the determination M classification The corresponding spokesman of the sound bite of classification, comprising:

From respectively selecting at least one sound bite to be converted into the sound bite of each classification in the sound bite of the M classification Text fragments, obtain L text fragments, and L is natural number, L >=M；

Show that the L text fragments and speech list, the speech list include every in the M spokesman to user The information of a spokesman；

Matching instruction is received, the matching instruction is that being used to indicate for user sending will be each in the L text fragments Text fragments and spokesman carry out matched instruction；

The corresponding speech of sound bite of each classification in the sound bite of the M classification is determined according to the matching instruction People.

3. the method according to claim 1, wherein each in the sound bite of the determination M classification The corresponding spokesman of the sound bite of classification, comprising:

From at least one sound bite is respectively selected in the sound bite of each classification in the sound bite of the M classification, Z is obtained A sound bite, Z are natural number, Z >=M；

The Z sound bite selected is played to user and shows speech list, and the speech list includes the M The information of each spokesman in a spokesman；

Matching instruction is received, the matching instruction is that being used to indicate for user sending will be each in the Z sound bite Sound bite and spokesman carry out matched instruction；

4. the method according to claim 1, wherein described cluster N number of sound bite, comprising:

S1: M sound bite is randomly choosed from N number of sound bite, using the M sound bite selected as M classification Cluster centre；

S2: it for i-th of sound bite in remaining N-M sound bite, calculates i-th of sound bite and M poly- The distance between each cluster centre in class center, and i-th of sound bite is referred to and i-th of sound bite In the nearest corresponding classification of cluster centre, i is successively taken 1 to the natural number between N-M；

S3: after the M sound bite, which is sorted out, to be completed, according to the sound bite that each classification includes in the M classification The cluster centre of the M classification is recalculated, and updates the cluster centre of the M classification,

Circulation execute S2 and S3, until the adjacent cluster centre twice of each classification in the M classification distance preset away from From within.

5. method according to any one of claims 1 to 4, which is characterized in that it is described to be split the conference voice, Obtain N number of sound bite, comprising:

Determine the silence clip in the conference voice；

Remove the silence clip in the conference voice；

The conference voice after removing the silence clip is split according to the silence clip, obtains W long sound bites, W is the natural number more than or equal to 2, W < N；

Extract the acoustic feature of each long sound bite in the W long sound bites；

Relative Entropy Analysis is carried out to the acoustic feature of each long sound bite in the W long sound bites；

Cutting is carried out to the W long sound bites according to the result of Relative Entropy Analysis, obtains N number of sound bite.

6. a kind of minutes generating means, which is characterized in that described device includes:

Acquiring unit, for obtaining conference voice；

Cutting unit obtains N number of sound bite, N is the natural number more than or equal to 2 for the conference voice to be split；

Cluster cell obtains the sound bite of M classification, M is more than or equal to 2 for clustering N number of sound bite Natural number, M≤N, the sound bite of the M classification respectively with M spokesman have one-to-one relationship；

First determination unit, the corresponding speech of sound bite of each classification in the sound bite for determining the M classification People；

Second determination unit, for determining each spokesman in the M spokesman according to the sound bite of the M classification Speech content；

Generation unit, for generating minutes according to the speech content of each spokesman in the M spokesman.

7. device according to claim 6, which is characterized in that first determination unit includes:

First choice subelement, respectively select in the sound bite for each classification in the sound bite from the M classification to A few sound bite is converted into text fragments, obtains L text fragments, L is natural number, L >=M；

First shows subelement, for showing the L text fragments and speech list, the speech list packet to user Include the information of each spokesman in the M spokesman；

First receiving subelement, for receiving matching instruction, the matching instruction is used to indicate for what the user issued by institute It states each text fragments and spokesman in L text fragments and carries out matched instruction；

First determines subelement, each classification in the sound bite for determining the M classification according to the matching instruction The corresponding spokesman of sound bite.

8. device according to claim 6, which is characterized in that first determination unit includes:

Second selection subelement, respectively select in the sound bite for each classification in the sound bite from the M classification to A few sound bite, obtains Z sound bite, Z is natural number, Z >=M；

Second shows subelement, described for playing the Z sound bite selected to user and showing speech list Speech list includes the information of each spokesman in the M spokesman；

Second receiving subelement, for receiving matching instruction, the matching instruction is used to indicate for what the user issued by institute It states each sound bite and spokesman in Z sound bite and carries out matched instruction；

Second determines subelement, each classification in the sound bite for determining the M classification according to the matching instruction The corresponding spokesman of sound bite.

9. a kind of storage medium, which is characterized in that the storage medium includes the program of storage, wherein run in described program When control the storage medium where equipment perform claim require any one of 1 to 5 described in minutes generation method.

10. a kind of computer equipment, including memory and processor, the memory is for storing the letter including program instruction Breath, the processor are used to control the execution of program instruction, it is characterised in that: described program instruction is loaded and executed by processor The step of minutes generation method described in Shi Shixian claim 1 to 5 any one.