CN110335612A

CN110335612A - Minutes generation method, device and storage medium based on speech recognition

Info

Publication number: CN110335612A
Application number: CN201910627403.1A
Authority: CN
Inventors: 林子童; 邵嘉琦; 刘屹; 肖金平; 郭翼斌; 万正勇; 沈志勇
Original assignee: China Merchants Finance Technology Co Ltd
Current assignee: China Merchants Finance Technology Co Ltd
Priority date: 2019-07-11
Filing date: 2019-07-11
Publication date: 2019-10-15

Abstract

The minutes generation method based on speech recognition that disclosed herein a kind of obtains audio to be converted this method comprises: receiving the minutes that user issues generates instruction；Sentence division is carried out to the audio to be converted, obtains the audio sentence of the audio to be converted；Vocal print feature is extracted from the audio sentence identified respectively, the corresponding vocal print feature of each audio sentence is compared with default vocal print feature library, determine the corresponding speaker's identity information of each audio sentence, and the audio sentence is divided by voice segments according to the speaker's identity information, determine the corresponding voice segments set of the audio to be converted；The corresponding target voice identification model of each voice segments is called, the corresponding text of each voice segments is successively obtained；And generate the corresponding minutes of the audio to be converted.The present invention is also disclosed that a kind of electronic device and computer storage medium.Using the present invention, the accuracy and efficiency of minutes generation can be improved.

Description

Minutes generation method, device and storage medium based on speech recognition

Technical field

The present invention relates to Internet technical field more particularly to a kind of minutes generation method based on speech recognition, Electronic device and computer readable storage medium.

Background technique

Currently, the writing mode of minutes is main are as follows: firstly, meeting on-the-spot record keyword；Secondly, in meeting after meeting Keyword is found in view recording and keyword hard of hearing is nearby recorded and expanded keyword is to form minutes.But due to key There is no corresponding relationship between word and recording, record personnel need to find by artificial positioning repeatedly when ransacing keyword after the meeting, The time is expended, operation is also more troublesome, further, only manually hard of hearing if the same keyword occurs repeatedly in meeting The case where recording positioning is likely to occur location of mistake, causes minutes misregistration occur.

To solve the above-mentioned problems, occur relying on Voice Conversion Techniques on the market at present and automatically generate minutes text This minutes product, however, this existing minutes product is usually simple speech-to-text product, voice turns The accuracy rate changed cannot ensure that record personnel obtained after use is a long text, it and session recording have no hook, In addition speech-to-text technology is not mature enough, often record personnel after taking text because turn text it is wrong it is more have no way of doing it, most Afterwards or it can only go to complete minutes by the mode of artificial playback.

Therefore, how convenient, accurately generate minutes as a technical problem urgently to be resolved.

Summary of the invention

In view of the foregoing, the present invention provide a kind of minutes generation method based on speech recognition, electronic device and Computer readable storage medium, main purpose are to improve the efficiency and accuracy that minutes generate.

To achieve the above object, the present invention provides a kind of minutes generation method based on speech recognition, this method packet It includes:

Receiving step: the minutes that user issues are received and generate instruction, instruction is generated according to the minutes and obtains Audio to be converted, alternatively, timing or obtaining audio to be converted from default store path in real time；

First partiting step: sentence division is carried out to the audio to be converted, obtains the audio sentence of the audio to be converted Son；

Second partiting step: extracting vocal print feature from the audio sentence respectively, by the sound of each audio sentence Line feature is compared and analyzed with default vocal print feature library, determines the corresponding speaker's identity information of each audio sentence, And the audio sentence is divided by voice segments according to the speaker's identity information, determine the corresponding language of the audio to be converted Segment set；

Speech recognition steps: each language is called according to the corresponding speaker's identity information of voice segments each in institute's speech segment set Each voice segments are successively inputted corresponding target voice identification model, obtain each language by the corresponding target voice identification model of segment The corresponding text fragments of segment, wherein the target voice identification model is carried out based on accent corpus and industry corpus Update what training obtained；And

Generation step: merging the corresponding text fragments of each voice segments, generates the corresponding target text of the audio to be converted, And corresponding voice segments and speaker's identity information are associated in each of the target text text fragments, described in generation The corresponding minutes of audio to be converted.

In addition, to achieve the above object, the present invention also provides a kind of electronic device, which includes: memory, processing Device, the minutes that be stored in the memory to run on the processor generate program, and the minutes generate Program can realize any step in the minutes generation method based on speech recognition as described above when being executed by the processor Suddenly.

In addition, to achieve the above object, it is described computer-readable the present invention also provides a kind of computer readable storage medium It include that minutes generate program in storage medium, the minutes generate when program is executed by processor, it can be achieved that as above Arbitrary steps in the minutes generation method based on speech recognition.

Minutes generation method, electronic device and computer-readable storage medium proposed by the present invention based on speech recognition Matter, 1. carry out subordinate sentence, speech feature extraction and speaker's identity information matches by treating transducing audio, true according to matching result Determine the corresponding voice segments set of audio to be converted, different target voice identification models is called to carry out voice to each voice segments respectively Identification, improves the efficiency and accuracy rate of speech recognition, lays the foundation to be subsequently generated the minutes of complete and accurate；2. passing through Training pattern is updated using speaker's accent corpus and industry corpus, improves the accuracy of speech recognition；3. passing through association Speaker's identity information, voice segments, text fragments, keyword etc. generate minutes, improve minutes integrality and Convenience.

Detailed description of the invention

Fig. 1 is that the present invention is based on the flow charts of the minutes generation method preferred embodiment of speech recognition；

Fig. 2 is the schematic diagram of electronic device preferred embodiment of the present invention；

Fig. 3 is the program module schematic diagram that minutes generate program preferred embodiment in Fig. 2.

The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.

Specific embodiment

It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not intended to limit the present invention.

The present invention provides a kind of minutes generation method based on speech recognition.This method can be held by a device Row, which can be by software and or hardware realization.

Shown in referring to Fig.1, for the present invention is based on the flow charts of the minutes generation method preferred embodiment of speech recognition.

In one embodiment of minutes generation method the present invention is based on speech recognition, this method only includes: step S1- Step S5.

Step S1, receive user issue minutes generate instruction, according to the minutes generate instruction obtain to Transducing audio, alternatively, timing or obtaining audio to be converted from default store path in real time.

In the following description, based on electronic device, various embodiments of the present invention are illustrated.

In the present embodiment, user issues minutes to electronic device by terminal and generates instruction, wherein in described instruction Including audio to be converted；Above-mentioned audio to be converted is the speech audio recorded in conference process, can be and passes through words by user The speech ciphering equipments such as cylinder are inputted and are saved, alternatively, the voice information paper downloaded from the Internet by user or locally imported.It is above-mentioned default Store path is not limited only to the database for storing minutes related audio.

Above-mentioned timing or the step of obtaining audio to be converted from default store path in real time include: timing (every morning 9:00, every afternoon 5:30) judge to whether there is unconverted minutes related audio in store path, if so, Using unconverted minutes related audio as audio to be converted, if it is not, then audio to be converted is not present in judgement.Alternatively, When one section of minutes related audio is written in default store path, then as audio to be converted and read out, To execute subsequent step.

Step S2 carries out sentence division to the audio to be converted based on default sentence division rule, obtains described wait turn Change the audio sentence of audio.

Treating the purpose that transducing audio carries out sentence cutting is the short sentence for being easier to carry out speech recognition in order to obtain, is improved The subsequent accuracy converted the audio into as text.In the present embodiment, described to be based on default sentence division rule to described wait turn It changes audio and carries out sentence division, obtain the audio sentence of the audio to be converted, comprising:

First in a1, the identification audio to be converted pauses, at the beginning of record first pauses and the end time；

A2, the first sentence in the audio to be converted is identified, and the end time that first is paused is as first At the beginning of son；

A3, identification second pause, record second pause at the beginning of and the end time, and will second pause at the beginning of Between end time as the first sentence, realize the division of the first sentence；

A4, above-mentioned steps are successively executed, until the audio to be converted terminates, obtains all sounds of the audio to be converted Frequency sentence.

Wherein, the first pause, second pause including mute section, non-speech segment in audio to be converted；First sentence be to The voice segments of transducing audio.It should be noted that first pauses and second pauses only for the corresponding pause of differentiation different time.

It is understood that the division result of audio sentence and the accuracy rate that follow audio is converted are closely bound up, audio sentence Son division accuracy rate is higher, and the accuracy rate of audio conversion is higher.In the present embodiment, each, which pauses, has minimum length limitation, For ignoring short sound information, such as the instantaneous ventilation of speaker etc., to protect the integrality of a word；Division each of obtains Sentence is limited with minimum length, for filtering out the invalid information in short-term in audio, for example, the cough etc. of speaker；Meanwhile Dividing obtained each sentence also has maximum length limitation, and for limiting the length of sentence, the conversion for improving follow audio is quasi- True rate.

Step S3 extracts vocal print feature from the audio sentence respectively, by the vocal print feature of each audio sentence It is compared and analyzed with default vocal print feature library, determines the corresponding speaker's identity information of each audio sentence, and according to The audio sentence is divided into voice segments by the speaker's identity information, determines the corresponding voice segments collection of the audio to be converted It closes.

It include: the vocal print feature of each employee of company P by taking company P as an example, in above-mentioned default vocal print feature library and corresponding Worker's information.Above-mentioned speaker's identity information includes: speaker's name, native place, accent etc..

In the present embodiment, described " the audio sentence is divided by voice segments according to the speaker's identity information " Step includes:

The identical audio sentence of temporally adjacent and corresponding speaker's identity information is merged and generates a voice segments, and root The beginning and ending time at least one the audio sentence for including according to institute's speech segment determines the beginning and ending time of institute's speech segment.

Wherein, the beginning and ending time at least one audio sentence that the beginning and ending time of each voice segments is contained by it is determining, example Such as, using the initial time of first audio sentence of a voice segments as the initial time of the voice segments, the last one audio Termination time of the termination time of sentence as the voice segments.

It include: voice segments and the corresponding speaker's identity information of each voice segments in voice segments set.For example, audio sentence is drawn The result got according to chronological order successively are as follows: sentence 1, sentence 2, sentence 3, sentence 4, sentence 5；Each audio sentence pair The speaker answered is respectively as follows: first, second, second, third, second；So, final voice segments set include: voice segments 1 (sentence 1), First }, { voice segments 2 (sentence 2, sentence 3), second }, { voice segments 3 (sentence 4), third }, { voice segments 4 (sentence 5), second }.

It is understood that needing periodically to be updated default vocal print feature library, to improve the efficiency of vocal print feature comparison. In addition, extraction vocal print feature and vocal print are more mature than peer to peer technology from audio, do not repeat here.

In other embodiments, the audio to be converted can also be the voice by the real-time typing of microphone, by pre- First microphone signal channel is numbered, and microphone signal channel number and speaker's identity need to be predefined before meeting The corresponding relationship of information.It, can also be logical by microphone signal when audio to be converted is the voice by the real-time typing of microphone Road number confirms corresponding speaker's identity information, does not repeat here.

Above-mentioned steps pass through the identity for determining speaker, on the one hand determine the corresponding spokesman of each audio sentence, facilitate The integrality of minutes；On the other hand, optimal voice transformation model is called convenient for the subsequent identity information according to spokesman, To improve the accuracy of voice conversion.

Step S4 calls each voice segments pair according to the corresponding speaker's identity information of voice segments each in institute's speech segment set Each voice segments are successively inputted corresponding target voice identification model, obtain each voice segments pair by the target voice identification model answered The text fragments answered, wherein the target voice identification model is to be updated instruction based on accent corpus and industry corpus It gets；

In order to improve the accuracy of speech recognition, the target voice identification model is in general speech recognition modeling On the basis of carried out twice update training:

1) training is updated to general speech recognition modeling according to speaker's accent (that is, language feature), obtains One voice transformation model, first speech recognition modeling are determined by following steps:

Accent is divided into several major class, for example, without accent (that is, standard mandarin), Beijing accent, Shandong accent, Guangdong The corresponding recorded audio of all kinds of accents is collected in accent, Hunan accent, Sichuan accent etc. respectively；

The corresponding recorded audio of all kinds of accents is pre-processed, leave out be not easy, the inconvenient segment understood, and by remaining piece Section is converted to writing text, obtains the corpus of all kinds of accents；

Processed audio and writing text are sent into general speech recognition modeling, so that model, which obtains, is directed to specific mouth The optimization of sound；

In meeting actual scene, it may be found that transcription error segment be re-fed into model and carry out re-optimization, respectively To corresponding first speech recognition modeling of each accent classification.

2) training is updated according to company, industrial nature the first speech recognition modeling corresponding to all kinds of accents, obtained Second voice transformation model, second speech recognition modeling are determined by following steps:

The company of compiling/industry special-purpose word list, saves in the form of text；

So that special messenger is read aloud above-mentioned special-purpose word with all kinds of accents, forms the corresponding audio file of all kinds of accents；

Form that text is matched with audio file is sent into corresponding first speech recognition modeling of all kinds of accents to instruct Practice, so that each first speech recognition modeling, which obtains, is directed to the optimization of specific company/industry；

In meeting actual scene, more relevant to proprietary name word corpus it will be sent into model and carry out re-optimization, point Corresponding second speech recognition modeling of each accent classification is not obtained.

For example, determining the corresponding artificial first of speaking of current speech segment 1 by Application on Voiceprint Recognition, determined according to the identity information of first Its accent is Shandong, obtains the corresponding second voice transformation model of Shandong accent as target voice identification model.

Above-mentioned steps by training general voice transformation model, and according to speaker before carrying out audio conversion in advance Accent feature training is updated to speech recognition modeling, to improve voice transformation model to the identification energy of the voice of speaker Power, while training is updated to voice transformation model also according to company/industrial nature, voice transformation model is improved to company spy Determine the recognition capability of business voice.

Step S5 merges the corresponding text fragments of each voice segments, generates the corresponding target text of the audio to be converted, and Be associated with corresponding voice segments and speaker's identity information in each of the target text text fragments, generate it is described to The corresponding minutes of transducing audio.

For example, successively obtaining the corresponding text of each voice segments in upper speech segment set is respectively as follows: text 1, text 2, text This 3, text 4, text 5, merge splicing to the text of acquisition and obtain the corresponding target text of audio to be converted.Then, root Beginning and ending time according to each voice segments is from intercepting corresponding voice segments and text piece corresponding in target text in voice segments to be converted Section is associated, that is to say, that each text fragments marks corresponding speaker information and its correspondence in target text Voice segments, to generate minutes, the minutes saved, which are simultaneously pushed to minutes and generate instruction, to be corresponded to Terminal.

In the present embodiment, speaker information, voice segments are associated in the form of hyperlink with text fragments.

By being associated with information above in minutes, convenient for minutes manager correlative segment hard of hearing and adjustment meeting View record.

The minutes generation method based on speech recognition that above-described embodiment proposes, is divided by treating transducing audio Sentence, speech feature extraction and matching, determine the corresponding voice segments of audio to be converted according to matching result, call different mesh respectively It marks speech recognition modeling and speech recognition is carried out to each voice segments, the efficiency and accuracy rate of speech recognition are improved, to improve The accuracy rate that minutes generate；Meanwhile by being associated with speaker's identity information, voice segments, text fragments, meeting note is generated Record, improves the integrality and convenience of minutes.

Further, in order to improve the conversion accuracy of audio to be converted, the present invention is based on the meeting of speech recognition notes It records in another embodiment of generation method, before step S2, this method further include: the audio to be converted is pre-processed, Obtain pretreated audio to be converted.

In general meeting, due to the influence of ambient enviroment, different noises can be generated, it is therefore desirable to minutes pair The audio to be converted answered is pre-processed.The pretreatment includes but are not limited to:

B1, echo cancellation process is carried out；For example, echo canceling method can be used, it can also pass through estimation echo letter Number size, then receive signal in subtract the estimated value to offset echo；

B2, beam forming processing is carried out；For example, the voice messaging of user is acquired in different direction by multiple microphones, Determine the direction of sound source.According to the weighted of different direction, it is weighted summation.For example, the weight ratio of Sounnd source direction other The sound weight in orientation is bigger, to guarantee the voice messaging of enhancing user's input, weakens the influence of other sound；

B3, noise reduction process is carried out；Such as: it can first pass through using identical as frequency noise, amplitude is identical, opposite in phase Sound is cancelled out each other, and then eliminates reverberation using the audio plug of dereverberation or microphone array；

B4, enhancing enhanced processing is carried out；For example, amplifying processing to audio using AGC (automatic growth control) mode.

The minutes generation method based on speech recognition that above-described embodiment proposes is carried out in advance by treating transducing audio Processing, reduces external interference, the accuracy of speech recognition can be improved, to lay good base to be subsequently generated minutes Plinth.

Further, in order to keep minutes apparent, in the minutes generation side of the invention based on speech recognition In another embodiment of method, this method further include:

The minutes are segmented, the list after being segmented, and identified from the list after the participle Keyword；

The corresponding text fragments set of each keyword is determined respectively, is believed according to the corresponding speaker's identity of each text fragments Breath classifies to the text fragments set, and according to chronological order to each keyword and the corresponding text of each keyword Segment is ranked up, the text fragments set after obtaining the corresponding sequence of each keyword.

Wherein, the step of above-mentioned " segmenting to minutes " includes: a) based on default vocabulary to the minutes It is matched, the first list after being segmented, wherein vocabulary is the cooperation peculiar special-purpose word of the prepared company of corporate business With the proprietary word of company the industry；B) for remaining text, using the segmenting method based on understanding and based on statistics Segmenting method segments the remaining text of step a), the second list after being segmented；C) remove asemantic stop words, Third list such as ' ', ' ', after being segmented；D) merge above-mentioned first list, second list and third list, obtain List after final participle.Text is finally switched to the list for including multiple words by participle.

The step of above-mentioned " identifying keyword from the list after the participle " includes: a) to calculate in the list each The information value of word, for example, tf-idf value (term frequency-inverse document frequency, word frequency- Reverse document-frequency)；B) judge whether the information value of each word is greater than or equal to preset threshold respectively, information value is big It is determined as keyword in or equal to the word of preset threshold, wherein preset threshold can be adjusted according to actual needs.

Assuming that keyword A, B, C are identified from minutes, and by taking keyword A as an example, keyword A in above-mentioned ranking results It include: the corresponding text fragments 1 of speaker's first, the corresponding text fragments 2 of speaker's second, text in corresponding text fragments set Segment 4, the corresponding text fragments 3 of speaker third.

It further, can also include each text fragments pair in the text fragments set after the corresponding sequence of each keyword The voice segments answered are convenient for minutes manager and inquiry's correlative segment hard of hearing by associated text segment and voice segments.

The minutes generation method based on speech recognition that above-described embodiment proposes passes through association speaker's identity letter Breath, voice segments, text fragments, keyword etc. generate minutes, improve the integrality and convenience of minutes.

In minutes generation method another embodiment of the invention based on speech recognition, this method further include:

The minutes that response user issues check instruction, show the minutes to user；And/or

The text fragments clicking operation that user issues is responded, shows the corresponding related information of the text fragments to user； For example, related information includes: keyword, speaker's identity information, the link of corresponding voice segments, voice segments are linked when the user clicks When, play current speech segment；And/or

It responds the minutes that user issues and modifies instruction, instructed based on the modification and update and save the meeting note Record；And/or

Respond the inquiry instruction for the carrying inquiry field that user issues, inquiry and the inquiry word from the minutes The text fragments inquired and corresponding related information are fed back to the user in a preset form by the matched text fragments of section.Its In, inquiry field can be keyword and may not be keyword, inquire field match query can be fuzzy search can also To be semantic searching, do not repeat here.After being matched to corresponding text fragments, in a preset form (for example, dendrogram, or Person is according to chronological order etc.) all text fragments corresponding with inquiry field and related information are shown to user, for example, closing Keyword, speaker's identity information, the link of corresponding voice segments etc..

The present invention also proposes a kind of electronic device.It is the signal of electronic device preferred embodiment of the present invention referring to shown in Fig. 2 Figure.

In the present embodiment, electronic device 1 can be server, smart phone, tablet computer, portable computer, on table The terminal device having data processing function such as type computer, the server can be rack-mount server, blade type service Device, tower server or Cabinet-type server.

The electronic device 1 includes memory 11, processor 12 and network interface 13.

Wherein, memory 11 include at least a type of readable storage medium storing program for executing, the readable storage medium storing program for executing include flash memory, Hard disk, multimedia card, card-type memory (for example, SD or DX memory etc.), magnetic storage, disk, CD etc..Memory 11 It can be the internal storage unit of the electronic device 1, such as the hard disk of the electronic device 1 in some embodiments.Memory 11 are also possible to be equipped on the External memory equipment of the electronic device 1, such as the electronic device 1 in further embodiments Plug-in type hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card dodge Deposit card (Flash Card) etc..Further, memory 11 can also both include the internal storage unit of the electronic device 1 or wrap Include External memory equipment.

Memory 11 can be not only used for the application software and Various types of data that storage is installed on the electronic device 1, for example, meeting Record generator 10 etc. is discussed, can be also used for temporarily storing the data that has exported or will export.

Processor 12 can be in some embodiments a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor or other data processing chips, the program for being stored in run memory 11 Code or processing data, for example, minutes generate program 10 etc..

Network interface 13 optionally may include standard wireline interface and wireless interface (such as WI-FI interface), be commonly used in Communication connection is established between the electronic device 1 and other electronic equipments, for example, minutes manager and minutes inquiry The terminal that person uses.The component 11-13 of electronic device 1 is in communication with each other by communication bus.

Fig. 2 illustrates only the electronic device 1 with component 11-13, it will be appreciated by persons skilled in the art that Fig. 2 shows Structure out does not constitute the restriction to electronic device 1, may include than illustrating less perhaps more components or combining certain A little components or different component layouts.

Optionally, the electronic device 1 can also include user interface, user interface may include display (Display), Input unit such as keyboard (Keyboard), optional user interface can also include standard wireline interface and wireless interface.

Optionally, in some embodiments, display can be light-emitting diode display, liquid crystal display, touch control type LCD and show Device and Organic Light Emitting Diode (Organic Light-Emitting Diode, OLED) touch device etc..Wherein, display It is properly termed as display screen or display unit, for showing the information handled in the electronic apparatus 1 and for showing visually User interface.

In 1 embodiment of electronic device shown in Fig. 2, as storage meeting in a kind of memory 11 of computer storage medium The program code of record generator 10 is discussed to realize such as when processor 12 executes the program code of minutes generation program 10 Lower step:

Receiving step: the minutes that user issues are received and generate instruction, instruction is generated according to the minutes and obtains Audio to be converted, alternatively, timing or obtaining audio to be converted from default store path in real time.

In the present embodiment, user issues minutes to electronic device 1 by terminal and generates instruction, wherein described instruction In include audio to be converted；Above-mentioned audio to be converted is the speech audio recorded in conference process, can be and is passed through by user The speech ciphering equipments such as microphone are inputted and are saved, alternatively, the voice information paper downloaded from the Internet by user or locally imported.It is above-mentioned pre- If store path is not limited only to the database for storing minutes related audio.

First partiting step: sentence division is carried out to the audio to be converted based on default sentence division rule, obtains institute State the audio sentence of audio to be converted.

Second partiting step: extracting vocal print feature from the audio sentence respectively, by the sound of each audio sentence Line feature is compared and analyzed with default vocal print feature library, determines the corresponding speaker's identity information of each audio sentence, And the audio sentence is divided by voice segments according to the speaker's identity information, determine the corresponding language of the audio to be converted Segment set.

Speech recognition steps: each language is called according to the corresponding speaker's identity information of voice segments each in institute's speech segment set Each voice segments are successively inputted corresponding target voice identification model, obtain each language by the corresponding target voice identification model of segment The corresponding text fragments of segment, wherein the target voice identification model is carried out based on accent corpus and industry corpus Update what training obtained；

By being associated with information above in minutes, convenient for minutes manager correlative segment hard of hearing and adjustment meeting View record.The electronic device 1 that above-described embodiment proposes carries out subordinate sentence, speech feature extraction and matching by treating transducing audio, The corresponding voice segments of audio to be converted are determined according to matching result, call different target voice identification models to each voice respectively Duan Jinhang speech recognition, improves the efficiency and accuracy rate of speech recognition, to improve the accuracy rate of minutes generation；Together When, by being associated with speaker's identity information, voice segments, text fragments, minutes are generated, the integrality of minutes is improved And convenience.

Further, in order to improve the conversion accuracy of audio to be converted, in another embodiment of electronic device 1 of the present invention In, before the first partiting step, the program code that processor 12 executes minutes generation program 10 is also realized: pretreatment step Suddenly.

Pre-treatment step: pre-processing the audio to be converted, obtains pretreated audio to be converted.

The electronic device 1 that above-described embodiment proposes, is pre-processed by treating transducing audio, reduces external interference, The accuracy of speech recognition can be improved, to lay a good foundation to be subsequently generated minutes.

Further, in order to keep minutes apparent, in another embodiment of electronic device 1 of the present invention, processor 12 When executing the program code of minutes generation program 10, following steps are also realized:

The minutes are segmented, the list after being segmented, and identified from the list after the participle Keyword；And

The step of above-mentioned " identifying keyword from the list after the participle " includes: a) to calculate in the list each The information value of word, for example, tf-idf value；B) judge whether the information value of each word is greater than or equal to default threshold respectively The word that information value is greater than or equal to preset threshold is determined as keyword, wherein preset threshold can be according to actual needs by value It is adjusted.

The electronic device 1 that above-described embodiment proposes, by being associated with speaker's identity information, voice segments, text fragments, key Word etc. generates minutes, improves the integrality and convenience of minutes.

In another embodiment of electronic device 1 of the present invention, processor 12 executes the program generation that minutes generate program 10 When code, following steps are also realized:

Optionally, in other examples, minutes, which generate program 10, can also be divided into one or more Module, one or more module are stored in memory 11, and as performed by one or more processors 12, to complete this Invention, the so-called module of the present invention are the series of computation machine program instruction sections for referring to complete specific function.

For example, referring to the program module schematic diagram for shown in Fig. 3, being minutes generation program 10 in Fig. 2.

It is generated in 10 1 embodiment of program in the minutes, it includes: module 110- that minutes, which generate program 10, 150, in which:

Receiving module 110, the minutes for receiving user's sending generate instruction, are referred to according to minutes generation It enables and obtains audio to be converted, alternatively, timing or obtaining audio to be converted from default store path in real time；

First division module 120, for carrying out sentence division to the audio to be converted based on default sentence division rule, Obtain the audio sentence of the audio to be converted；

Second division module 130, for extracting vocal print feature from the audio sentence respectively, by each audio sentence The vocal print feature of son is compared and analyzed with default vocal print feature library, determines the corresponding speaker's identity of each audio sentence Information, and the audio sentence is divided by voice segments according to the speaker's identity information, determine the audio pair to be converted The voice segments set answered；

Speech recognition module 140, for according to the corresponding speaker's identity information of voice segments each in institute's speech segment set The corresponding target voice identification model of each voice segments is called, each voice segments are successively inputted into corresponding target voice identification model, Obtain the corresponding text fragments of each voice segments, wherein the target voice identification model is based on accent corpus and jargon Material library is updated what training obtained；And

Generation module 150 generates the corresponding mesh of the audio to be converted for merging the corresponding text fragments of each voice segments Text is marked, and is associated with corresponding voice segments and speaker's identity information in each of the target text text fragments, Generate the corresponding minutes of the audio to be converted.

The functions or operations step that the module 110-150 is realized is similar as above, and and will not be described here in detail.

In addition, the embodiment of the present invention also proposes a kind of computer readable storage medium, the computer readable storage medium In include that minutes generate program 10, the minutes, which generate, realizes following operation when program 10 is executed by processor:

First partiting step: sentence division is carried out to the audio to be converted based on default sentence division rule, obtains institute State the audio sentence of audio to be converted；

The specific embodiment of the computer readable storage medium of the present invention and the above-mentioned minutes based on speech recognition The specific embodiment of generation method is roughly the same, and details are not described herein.

The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.

It should be noted that, in this document, the terms "include", "comprise" or its any other variant are intended to non-row His property includes, so that the process, device, article or the method that include a series of elements not only include those elements, and And further include other elements that are not explicitly listed, or further include for this process, device, article or method institute it is intrinsic Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including being somebody's turn to do There is also other identical elements in the process, device of element, article or method.

Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but in many cases The former is more preferably embodiment.Based on this understanding, technical solution of the present invention substantially in other words does the prior art The part contributed out can be embodied in the form of software products, which is stored in one as described above In storage medium (such as ROM/RAM, magnetic disk, CD), including some instructions are used so that terminal device (it can be mobile phone, Computer, server or network equipment etc.) execute method described in each embodiment of the present invention.

The above is only a preferred embodiment of the present invention, is not intended to limit the scope of the invention, all to utilize this hair Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills Art field, is included within the scope of the present invention.

Claims

1. a kind of minutes generation method based on speech recognition is suitable for electronic device, which is characterized in that this method packet It includes:

Receiving step: the minutes that user issues are received and generate instruction, instruction is generated according to the minutes and obtains wait turn Audio is changed, alternatively, timing or obtaining audio to be converted from default store path in real time；

First partiting step: carrying out sentence division to the audio to be converted based on default sentence division rule, obtain it is described to The audio sentence of transducing audio；

Second partiting step: extracting vocal print feature from the audio sentence respectively, and the vocal print of each audio sentence is special Sign is compared and analyzed with default vocal print feature library, determines the corresponding speaker's identity information of each audio sentence, and root The audio sentence is divided into voice segments according to the speaker's identity information, determines the corresponding voice segments of the audio to be converted Set；

Speech recognition steps: each voice segments are called according to the corresponding speaker's identity information of voice segments each in institute's speech segment set Each voice segments are successively inputted corresponding target voice identification model, obtain each voice segments by corresponding target voice identification model Corresponding text fragments, wherein the target voice identification model is updated based on accent corpus and industry corpus What training obtained；And

Generation step: merging the corresponding text fragments of each voice segments, generates the corresponding target text of the audio to be converted, and It is associated with corresponding voice segments and speaker's identity information in each of the target text text fragments, generates described wait turn Change the corresponding minutes of audio.

2. the minutes generation method according to claim 1 based on speech recognition, which is characterized in that described first stroke Include: step by step

Identify in the audio to be converted first pause, record first pause at the beginning of and the end time；

Identify the first sentence in the audio to be converted, and the end time that first is paused is as the beginning of the first sentence Time；

Identification second pause, record second pause at the beginning of and the end time, and by second pause at the beginning of conduct The end time of first sentence realizes the division of the first sentence；And

Above-mentioned steps are successively executed, until the audio to be converted terminates, obtain all audio sentences of the audio to be converted.

3. the minutes generation method according to claim 1 based on speech recognition, which is characterized in that described according to institute It states speaker's identity information and the audio sentence is divided into voice segments, comprising:

The identical audio sentence of temporally adjacent and corresponding speaker's identity information is merged and generates a voice segments, and according to institute The beginning and ending time at least one audio sentence that speech segment includes determines the beginning and ending time of institute's speech segment.

4. the minutes generation method as claimed in any of claims 1 to 3 based on speech recognition, feature exist In, before the first partiting step, this method further include:

Pre-treatment step: pre-processing the audio to be converted, obtains pretreated audio to be converted, the pretreatment It include: echo cancellation process, beam forming processing, noise reduction process and enhancing enhanced processing.

5. the minutes generation method according to claim 4 based on speech recognition, which is characterized in that this method is also wrapped It includes:

The minutes are segmented, the list after being segmented, and identify key from the list after the participle Word；And

The corresponding text fragments set of each keyword is determined respectively, according to the corresponding speaker's identity information pair of each text fragments The text fragments set is classified, and according to chronological order to each keyword and the corresponding text fragments of each keyword It is ranked up, the text fragments set after obtaining the corresponding sequence of each keyword.

6. the minutes generation method according to claim 5 based on speech recognition, which is characterized in that this method is also wrapped It includes:

The text fragments clicking operation that user issues is responded, shows the corresponding related information of the text fragments to user；Association Information includes: keyword, speaker's identity information, the link of corresponding voice segments, and when voice segments link when the user clicks, broadcasting is worked as Preceding voice segments；And/or

It responds the minutes that user issues and modifies instruction, instructed based on the modification and update and save the minutes；And/ Or

Respond the inquiry instruction for the carrying inquiry field that user issues, inquiry and the inquiry field from the minutes The text fragments inquired and corresponding related information are fed back to the user in a preset form by the text fragments matched.

7. a kind of electronic device, which is characterized in that the device includes memory and processor, and being stored in the memory can be The minutes run on the processor generate program, and the minutes generate program and execute Shi Keshi by the processor Existing following steps:

First partiting step: sentence cutting is carried out to the audio to be converted based on preset audio sentence segmentation rules, is obtained The audio sentence of the audio to be converted；

8. electronic device according to claim 7, which is characterized in that the minutes generate program by the processor It can also be achieved following steps when execution:

9. electronic device according to claim 7, which is characterized in that the minutes generate program by the processor It can also be achieved following steps when execution:

10. a kind of computer readable storage medium, which is characterized in that include minutes in the computer readable storage medium Program is generated, the minutes generate when program is executed by processor, it can be achieved that such as any one of claim 1 to 6 institute The step of minutes generation method based on speech recognition stated.