CN108847241A

CN108847241A - It is method, electronic equipment and the storage medium of text by meeting speech recognition

Info

Publication number: CN108847241A
Application number: CN201810581922.4A
Authority: CN
Inventors: 王健宗; 于夕畔; 肖京
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2018-06-07
Filing date: 2018-06-07
Publication date: 2018-11-20
Anticipated expiration: 2038-06-07
Also published as: WO2019232991A1; CN108847241B

Abstract

A method of by meeting speech recognition be text, including：Text is converted by conference voice to be identified by speech recognition technology, identifies text as initial speech；Initial speech identification text is matched with pre-set text database, the speech recognition text after being matched；There is the speech recognition text rough draft of editable state according to the speech recognition text generation after matching；When detect have received edit operation on speech recognition text rough draft after, according to the speech recognition text generation after the edit operation have can not editing mode speech recognition text, as final speech recognition text.It is the electronic equipment and storage medium of text that the present invention also provides a kind of by meeting speech recognition.The present invention carries out first time matching by carrying out after tentatively identifying to voice to be identified, with pre-set text library, then is confirmed by manually carrying out second, and the effective correctness for guaranteeing text output content reduces the proof-reading amount of conference content, improves efficiency.

Description

It is method, electronic equipment and the storage medium of text by meeting speech recognition

Technical field

The present invention relates to technical field of voice recognition, and in particular to a kind of by meeting speech recognition is the method for text, electricity Sub- equipment and storage medium.

Background technique

Automatic speech recognition technology (Automatic Speech Recognition, ASR) is machine translation, robot control Computer " can be dictated " and goes out different people and is said for the core technology in the fields such as system and next-generation human-computer interaction interface Continuous speech realizes that " sound " arrives the conversion of " text ".

Currently, the application based on speech recognition is also more and more extensive, in this way along with the continuous development of speech recognition technology Technology infiltrated through family life, office realm, amusement etc..User, which passes through, utilizes opposite personal computer, notes External or built-in microphone inputs voice in the terminals such as this computer, tablet computer, dedicated learning terminal, smart phone, The conversion of voice-text is completed via speech recognition apparatus.

Existing speech recognition apparatus has very much, for example, the language of world-famous Nuance, Google for being widely used Sound identification service, China hold speech-recognition services of Iflytek etc..But it is in the greatest problem for carrying out speech recognition The accuracy rate of speech recognition can not also be kept away even possessing the Nuance of highest speech recognition accuracy in existing equipment Exempt from following problems：Frequently occurring for the unrelated vocabulary such as modal particle causes text analyzing difficulty to increase, the professional keyword identification in part Inaccuracy, taboo sensitive word can not identify, affect the readable analytical of meeting text.

Summary of the invention

In view of the foregoing, it is necessary to propose a kind of method, electronic equipment and storage by meeting speech recognition for text Text output content is being effectively guaranteed just by the double process of pre-set text database matching and manual confirmation in medium True property improves traditional voice and turns the unreasonable place of literal expression in text, effectively reduces the proof-reading of conference content Amount, improves efficiency.

The first aspect of the present invention provides a kind of method by meeting speech recognition for text, the method includes：

Text is converted by conference voice to be identified by speech recognition technology, identifies text as initial speech；

Initial speech identification text is matched with pre-set text database, the speech recognition text after being matched This；

There is the speech recognition text rough draft of editable state according to the speech recognition text generation after the matching；

When detect have received edit operation on the speech recognition text rough draft when, after the edit operation Speech recognition text generation have can not editing mode speech recognition text, as final speech recognition text.

Preferably, it is described by the initial speech identification text and pre-set text database match including：

Initial speech identification text is matched with default first text database, obtains the first matching result；

First matching result is matched with default second text database, obtains the second matching result；

Second matching result is matched with default third text database；

Wherein, multiple modal particles, default second text database are stored in default first text database In be stored with multiple professional words and corresponding phonetic, be stored with multiple taboo sensitive words in the default third text database.

Preferably, it is described by the initial speech identification text and default first text database match including：

Judge to match in the initial speech identification text with the presence or absence of with the word in default first text database The first word；

When determine exist in initial speech identification text and default first text database in word match When the first word, according to the first word for matching described in the modal particle model judgement based on deep learning network of training in advance It whether is modal particle to be deleted；

When the first word to match described in the determination is modal particle to be deleted, the initial speech is identified in text The first word to match is rejected；

When the first word to match described in the determination is not modal particle to be deleted, the initial speech is identified into text In the first word for matching retained.

Preferably, it is described by first matching result and default second text database match including：

Word in first matching result is converted into the first phonetic；

Judge in default second text database with the presence or absence of the second phonetic identical with first phonetic；

When determine there is the second phonetic identical with first phonetic in default second text database when, by the The corresponding word of two phonetics extracts, as the corresponding word of the first phonetic.

Preferably, it is described by second matching result and default third text database match including：

Judge in second matching result with the presence or absence of the to match with the word in default third text database Three words；

There is the third that matches with the word in default third text database in second matching result when determining When word, the third word to match in second matching result is rejected.

Preferably, the speech recognition text generation according to after the edit operation have can not editing mode voice Identify that text includes：

When the edit operation received be confirmation operation when, directly generate with can not editing mode speech recognition text This；

When the edit operation received is modification operation, receives the manual modification of user and saves the new content of modification, When receiving confirmation operation again, generate have can not editing mode speech recognition text.

Preferably, the method also includes：

By at corresponding modification original word and the modified neologism of user be associated storage；

When subsequently through speech recognition technology, text is converted for conference voice to be identified according to modified neologism This.

Preferably, the method also includes：

The corresponding diversified forms of each word are stored in advance, the diversified forms include：Simplified and traditional body form plus space form And nearly word form；

By the diversified forms progress corresponding with each word of pre-set text database of initial speech identification text Match.

The second aspect of the present invention provides a kind of electronic equipment, and the electronic equipment includes processor and memory, described Processor is used to realize the side by meeting speech recognition for text in the memory when executing the computer program stored Method.

The third aspect of the present invention provides a kind of computer readable storage medium, deposits on the computer readable storage medium Computer program is contained, the computer program realizes the side by meeting speech recognition for text when being executed by processor Method.

It is of the present invention by meeting speech recognition be text method, electronic equipment and storage medium, known by voice Conference voice to be identified is converted text by other technology, identifies text as initial speech；The initial speech is identified into text This is matched with pre-set text database, the speech recognition text after being matched；According to the speech recognition after the matching Text generation has the speech recognition text rough draft of editable state；It is received on the speech recognition text rough draft when detecting When edit operation, according to speech recognition text generation after the edit operation have can not editing mode speech recognition text This, as final speech recognition text.The present invention by voice to be identified carry out tentatively identify after, with pre-set text library into Row matches for the first time, then is confirmed by manually carrying out second.Process can effectively ensure that text output content just twice True property improves traditional voice and turns the unreasonable place of literal expression in text, effectively reduces the proof-reading of conference content Amount, improves efficiency.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this The embodiment of invention for those of ordinary skill in the art without creative efforts, can also basis The attached drawing of offer obtains other attached drawings.

Fig. 1 is the flow chart by the method that meeting speech recognition is text that the embodiment of the present invention one provides.

Fig. 2 is the functional block diagram provided by Embodiment 2 of the present invention by the device that meeting speech recognition is text.

Fig. 3 is the schematic diagram for the electronic equipment that the embodiment of the present invention three provides.

The present invention that the following detailed description will be further explained with reference to the above drawings.

Specific embodiment

To better understand the objects, features and advantages of the present invention, with reference to the accompanying drawing and specific real Applying example, the present invention will be described in detail.It should be noted that in the absence of conflict, the embodiment of the present invention and embodiment In feature can be combined with each other.

In the following description, numerous specific details are set forth in order to facilitate a full understanding of the present invention, described embodiment is only It is only a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill Personnel's every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.

Unless otherwise defined, all technical and scientific terms used herein and belong to technical field of the invention The normally understood meaning of technical staff is identical.Term as used herein in the specification of the present invention is intended merely to description tool The purpose of the embodiment of body, it is not intended that in the limitation present invention.

The embodiment of the present invention applies the method that meeting speech recognition is text in one or more electronic equipment. It is described that the method that meeting speech recognition is text also can be applied to by electronic equipment and pass through network and the electronic equipment In the hardware environment that the server being attached is constituted.Network includes but is not limited to：Wide area network, Metropolitan Area Network (MAN) or local area network.This Can the executing the method that meeting speech recognition is text by server of inventive embodiments, can also be held by electronic equipment Row；It can also be and executed jointly by server and electronic equipment.

It is described that needs can directly set the electronic equipment for the method that meeting speech recognition is text in electronics The function provided by method of the invention by meeting speech recognition for text, or installation are integrated on standby for realizing the present invention Method client.For another example, method provided by the present invention can also be with Software Development Kit (Software Development Kit, SDK) form operate in the equipment such as server, provided in the form of SDK by meeting speech recognition For the interface of the function of text, it is text that the interface that electronic equipment or other equipment pass through offer, which can be realized meeting speech recognition, This.

Embodiment one

Fig. 1 is the flow chart by the method that meeting speech recognition is text that the embodiment of the present invention one provides.According to difference Demand, the execution sequence in the flow chart can change, and certain steps can be omitted.

S11, text is converted for conference voice to be identified by speech recognition technology, identifies text as initial speech This.

In the present embodiment, the detailed process for converting conference voice to be identified to by speech recognition technology in text Including：

1) audio frequency characteristics for extracting conference voice to be identified, are converted to the acoustic feature vector of preset length；

2) described eigenvector is decoded by word order according to decoding algorithm；

3) the sub- word of corresponding word order is obtained by HMM phoneme model, the sub- word is initial consonant and simple or compound vowel of a Chinese syllable；

4) multiple sub- words are spliced by text according to preset Pronounceable dictionary；

5) it decodes to obtain optimal sequence using language model syntax rule, obtains text.

The syntax rule is viterbi algorithm.For example, the conference voice to be identified is " hello ", by spy Sign is converted into the acoustic feature vectors of 39 dimensions after extracting, by multiple HMM phoneme models obtain corresponding multiple sub- word/n//i// Multiple sub- words are spliced into word according to preset Pronounceable dictionary by h//ao/, as you, Buddhist nun；It is good, number.It is decoded by viterbi algorithm Obtain optimal sequence " hello " and by text output.

S12, initial speech identification text is matched with pre-set text database, the voice after being matched is known Other text.

In the present embodiment, at least three text databases can be preset, for example, the first text database, the second text Database and third circumferential edge library.First text database can be exclusively used in storing multiple modal particles, such as " uh ", " ", " right " etc., modal particle is unrelated with conference content, and easily influences voice again and be converted to the readability after text.Second text data Library can be exclusively used in storing multiple professional words and corresponding phonetic, such as " feature vector ", " eigenmatrix ", " tensor analysis ", Professional word is more complex, thus is easy batch during identifying voice and mistake occur.Third text database can be exclusively used in Multiple taboo sensitive words are stored, such as politics relevant name, porns, gambling and drugs, firearms and ammunition class, swears at people that it is non-to satirize class button at supertition heresy The words such as method info class, the appearance for avoiding sensitive word be easy to cause adverse effect.The present invention can also be set in advance according to the actual situation The 4th text database etc. is set, the sentence of storage name or place name etc. is exclusively used in.Herein for pre-set this paper The quantity of database and corresponding content are not specifically limited.

It is described match specifically including with pre-set text database by initial speech identification text：

1) initial speech identification text is matched with default first text database, obtains the first matching knot Fruit；

2) first matching result is matched with default second text database, obtains the second matching result；

3) second matching result is matched with default third text database.

Three independently operated threads can be preset：First thread, the second thread and third thread, first thread are held It is about to the initial speech identification text to refer to the program that default first text database is matched to obtain the first matching result It enables, first matching result is matched to obtain the second matching result by the execution of the second thread with default second text database Program instruction, third thread, which executes, refers to second matching result and the matched program of default third text database progress It enables.After first thread is finished, the second thread is executed at once, after the second thread is finished, executes third line at once Journey, three independently operated threads of setting execute different program instructions respectively, are conducive to improve matched speed, save matching Time.

In other embodiments, a thread can also be only set, and being sequentially carried out by one thread will be described initial Speech recognition text is matched to obtain with default first text database the program instruction of the first matching result, by described first Matching result is matched to obtain the program instruction of the second matching result, be matched described second with default second text database As a result matched program instruction is carried out with default third text database.

Specifically, it is described by the initial speech identification text and default first text database match including：Sentence The initial speech of breaking identifies in text with the presence or absence of the first word to match with the word in default first text database； When determining in initial speech identification text there is the first word to match with the word in default first text database When, the initial speech is identified that the first word to match in text is handled.

Preferably, it is described the initial speech is identified in text that the first word for matching carries out processing can also be into one Step includes：The first word to match described in modal particle model judgement according to training in advance based on deep learning network It whether is modal particle to be deleted；When the first word to match described in the determination is modal particle to be deleted, will it is described initially The first word to match in speech recognition text is rejected；The first word to match described in the determination is not to be deleted When modal particle, the initial speech is identified that the first word to match in text retains.

For example, it is assumed that initial speech identifies that text is " this is pretty good ", deposits in default first text database Modal particle " this " is contained, then phase has been determined after being matched initial speech identification text with default first text database The word matched is " this ", then judges the phase according to the modal particle model based on deep learning network of training in advance Whether matched first word " this " is modal particle to be deleted, and the modal particle model based on deep learning network determines The first word " this " to match is not belonging to modal particle to be deleted in " this is pretty good ", then by the initial speech The first word to match in identification text is retained, and the first obtained matching result is " this is pretty good ".

For another example, it is assumed that initial speech identifies that text is " this, we will have a meeting ", deposits in default first text database Modal particle " this " is contained, then phase has been determined after being matched initial speech identification text with default first text database The word matched is " this ", then judges the phase according to the modal particle model based on deep learning network of training in advance Whether matched first word " this " is modal particle to be deleted, and the modal particle model based on deep learning network determines The first word " this " to match belongs to modal particle to be deleted in " this, we will have a meeting ", then will be described initial The first word to match in speech recognition text is rejected, and the first obtained matching result is " we will have a meeting ".

Preferably, the training method of the modal particle model based on deep learning network may include：

1) text largely with the word in the first text database is obtained；

2) text is divided into positive sample text and negative sample text, the positive sample text is to need to retain the tone The text of word, the negative sample text are the text for needing to delete modal particle；

For example, the word in the first text database is " this ", then it is available largely to contain " this " word This paper of language, such as " this project is currently ongoing ", " whom this people is ", " this, uh, also in inquiry ", " this can With such ", wherein " this " is the language for needing to retain in " this project is currently ongoing " and " whom this people is " Gas word, " this " in " this, uh, also in inquiry ", " this can be such " is the modal particle for needing to delete.

3) the positive sample text is stamped into first identifier, the negative sample text is stamped into second identifier；

The word that the first identifier is used to identify in sample needs to retain, for example, it may be " 1 ".The second identifier Word for identifying in sample needs to delete, for example, it may be " 0 ".

4) the positive sample text input is trained into deep learning network, judges text and the input of output Whether the similarity between positive sample is greater than default similarity threshold, if the phase between the text and the positive sample of input of output When being greater than default similarity threshold like degree, then terminate the training of the modal particle model based on deep learning network.

The similarity between the text of output and the positive sample of input, template can be calculated by the method for template matching The method matched is the prior art, and in this not go into detail by the present invention.

Specifically, it is described by first matching result and default second text database match including：

1) word in first matching result is converted into the first phonetic；

2) judge in default second text database with the presence or absence of the second phonetic identical with first phonetic；

It 3), will when determining in default second text database in the presence of the second phonetic identical with first phonetic The corresponding word of second phonetic extracts, as the corresponding word of the first phonetic.

For example, it is assumed that the first matching result is " this is an original megaseisms ", by the word in the first matching result Being converted to the first phonetic is " zhe shi yige yuanshi juzhen "；Profession is stored in default second text database Word " matrix " and corresponding second phonetic " juzheng ", then exist in determining default second text database with it is described When identical second phonetic of the first phonetic, the corresponding word " matrix " of the second phonetic " juzheng " is extracted, as first The corresponding word of phonetic " juzheng ", the second obtained matching result are " this is an original matrix ".

Specifically, it is described by second matching result and default third text database match including：Judge institute It states in the second matching result with the presence or absence of the third word to match with the word in default third text database；When determining When stating the third word for existing in the second matching result and matching with the word in default third text database, by described second The third word to match in matching result is rejected.

S13, the speech recognition text rough draft according to the speech recognition text generation after the matching with editable state.

After being modified automatically to initial speech identification text according to matching result, the voice after being matched is known Other text can first generate the speech recognition text rough draft with editable state.Refer to that user can with editable state In the enterprising edlin operation of the speech recognition text rough draft of generation.The edit operation may include：Confirmation operation, modification behaviour Make.

The confirmation operation refers to that user confirms that the speech recognition text rough draft is correct, that is, determines modified language Sound identification text does not need to carry out any modification operation.The modification operation refers to that user confirms the speech recognition text rough draft It is wrong, individual or a small amount of word needs are adjusted, i.e., modified speech recognition text also needs to carry out people again Work manual modification.

S14, when detect have received edit operation on the speech recognition text rough draft when, according to the edit operation Speech recognition text generation afterwards have can not editing mode speech recognition text, as final speech recognition text.

Edit operation whether is had received on the speech recognition text rough draft to be detected in the following manner：It detects Whether touch operation is had received in the pre-set button surveyed on the speech recognition text rough draft, is connect in pre-set button when detecting When having received touch operation, it is believed that detect and have received edit operation on the speech recognition text rough draft；When not detecting When having received touch operation on to pre-set button, it is believed that detect and be not received by editor on the speech recognition text rough draft Operation.

The pre-set button can be ACK button, can also be modification button.The button can be virtual icon, also It can be physical button.The edit operation of corresponding ACK button is confirmation operation, and the edit operation of corresponding modification button is modification Operation.

The speech recognition text generation according to after the edit operation have can not editing mode speech recognition text Originally include：

Preferably, after the edit operation for receiving user is modification operation, the method can also include：Correspondence is repaired The original word and the modified neologism of user for changing place are associated storage；When subsequently through speech recognition technology, according to Conference voice to be identified is converted text by modified neologism.

Pass through each modification of record user and by the original word at corresponding modification and the modified neologism of user into Row associated storage is conducive to subsequent when carrying out speech recognition again, the neologism that can directly modify using user, to reduce knowledge Other error rate improves the accuracy of speech recognition, the especially trouble of reduction user modification.

Preferably, in pre-set text database, the method can also include：It is corresponding more that each word is stored in advance Kind form, the diversified forms may include, but be not limited to：Simplified and traditional body form plus space form and nearly word form etc..It is described by institute It states initial speech and identifies that text match with pre-set text database further including：The environment according to locating for meeting, will be described first Beginning speech recognition text diversified forms corresponding with each word of pre-set text database are matched, and obtain meeting meeting institute Locate the speech recognition text of environment.

Environment locating for meeting may include：The participant of meeting, meeting hold ground.

By the corresponding diversified forms of setting word, and the environment according to locating for meeting is by initial speech identification text and in advance It, can be sensitive to avoid some modal particles containing space, taboo if the word of every kind of form in text database is matched Word can not identify.In addition it is possible to using in different occasions, for example, when the participant of meeting is Taiwanese, in Taiwan In the case where etc. the complex form of Chinese characters accustomed to using, the initial speech that identifies identification text may be the complex form of Chinese characters and/or simplified Chinese character, thus It is necessary to match the simplified and traditional body form of word, the speech recognition text for meeting the complex form of Chinese characters that Taiwanese gets used to is obtained This.For another example, meeting is continent with holding, and according to the habit of mainlander, initial speech will be identified text and pre-set text number After being matched according to the word of every kind of form in library, the speech recognition text for meeting the simplified Chinese character that mainlander gets used to is obtained This.

It is of the present invention by meeting speech recognition be text method, pass through speech recognition technology for meeting to be identified Voice is converted into text, identifies text as initial speech；By the initial speech identification text and pre-set text database into Row matching, the speech recognition text after being matched；There is editable shape according to the speech recognition text generation after the matching The speech recognition text rough draft of state；When detect have received edit operation on the speech recognition text rough draft when, according to institute Speech recognition text generation after stating edit operation have can not editing mode speech recognition text, as final speech recognition Text.After passing through ASR technology and being converted to Text Mode to the voice of meeting, by calling the content for having deposited dictionary in text Appearance scans for, and the operation such as is replaced, deleted accordingly respectively.But it will lead to due to the minimum probability of operation such as replacing and deleting It is the text of mistake by original correct text modification, so modified text can be in by system again with revisable mode Now give user, while the place for marking out modified mistake confirms for user, user can place to system misoperation again It modifies.I.e. by carrying out after tentatively identifying to voice to be identified, first time matching is carried out with pre-set text library, then pass through It is artificial to carry out second of confirmation.Process can effectively ensure that the correctness of text output content twice, improve traditional voice Turn the unreasonable place of literal expression in text, effectively reduces the proof-reading amount of conference content, improve efficiency.

The above is only a specific embodiment of the invention, but scope of protection of the present invention is not limited thereto, for For those skilled in the art, without departing from the concept of the premise of the invention, improvement, but these can also be made It all belongs to the scope of protection of the present invention.

Below with reference to the 2nd to 3 figure, respectively to the electronic equipment for realizing the above-mentioned method by meeting speech recognition for text Functional module and hardware configuration are introduced.

Embodiment two

Fig. 2 is the present invention by the functional block diagram in device preferred embodiment that meeting speech recognition is text.

In some embodiments, described to run on the device 20 that meeting speech recognition is text in electronic equipment.It is described It may include multiple functional modules as composed by program code segments by the device 20 that meeting speech recognition is text.It is described will View speech recognition can store in memory for the program code of each program segment in the device 20 of text, and by least one Performed by a processor, using execution (being detailed in Fig. 1 and its associated description) by meeting speech recognition as the function of text.

In the present embodiment, the electronic equipment by function of the device 20 according to performed by it that meeting speech recognition is text Can, multiple functional modules can be divided into.The functional module may include：Identification module 201, generates matching module 202 Module 203, detecting module 204, relating module 205 and setup module 206.The so-called module of the present invention refers to that one kind can be by extremely A few processor is performed and can complete the series of computation machine program segment of fixed function, and storage is in memory. In some embodiments, it will be described in detail in subsequent embodiment about the function of each module.

Identification module 201, for converting text for conference voice to be identified by speech recognition technology, as initial Speech recognition text.

In the present embodiment, the identification module 201 converts text for conference voice to be identified by speech recognition technology This detailed process includes：

Matching module 202 obtains for matching with pre-set text database initial speech identification text Speech recognition text after matching.

In the present embodiment, at least three text databases can be preset, for example, the first text database, the second text Database and third circumferential edge library.First text database can be exclusively used in storing multiple modal particles, such as " uh ", " ", " right " etc., modal particle is unrelated with conference content, and easily influences voice again and be converted to the readability after text.Second text data Library can be exclusively used in storing multiple professional words, such as " feature vector ", " eigenmatrix ", " tensor analysis ", and professional word is more complex, It is thus easy batch during identifying voice and mistake occurs.It is quick that third text database can be exclusively used in storing multiple taboos Feel word, such as political relevant name, porns, gambling and drugs, firearms and ammunition class, swears at people and satirizes class button invalid information class word at supertition heresy, The appearance of taboo sensitive word be easy to cause adverse effect.The present invention can also preset the 4th text data according to the actual situation Library etc. is exclusively used in the sentence of storage name or place name etc..Herein for pre-set circumferential edge library quantity and Corresponding content is not specifically limited.

The matching module 202 carries out initial speech identification text to match specific packet with pre-set text database It includes：

3) second matching result is matched with default third text database, obtains third matching result.

The matching module 202 can also include：First matched sub-block 2020, the second matched sub-block 2022, third Matched sub-block 2024.

Specifically, first matched sub-block 2020 by initial speech identification text and presets the first text data Library carries out matching：Judge to whether there is in the initial speech identification text and the word in default first text database The first word to match；Exist and the word in default first text database when determining in the initial speech identification text When the first word to match, the initial speech is identified that the first word to match in text is handled.

Preferably, first matched sub-block 2020 is also used to：Deep learning net is based on according to training in advance Whether the first word to match described in the modal particle model judgement of network is modal particle to be deleted；Match described in the determination When first word is modal particle to be deleted, the initial speech is identified that the first word to match in text is rejected； When the first word to match described in the determination is not modal particle to be deleted, the initial speech is identified and is matched in text The first word retained.

Preferably, modal particle model process of first matched sub-block 2020 training based on deep learning network can be with Including：

1) text largely with the word in the first text database is obtained；

Specifically, second matched sub-block 2022 is by described by first matching result and default second textual data Carrying out matching according to library includes：Word in first matching result is converted into the first phonetic；Judge default second text It whether there is the second phonetic identical with first phonetic in database；When determining in default second text database When in the presence of the second phonetic identical with first phonetic, the corresponding word of the second phonetic is extracted, as the first phonetic Corresponding word.

Specifically, the third matched sub-block 2024 by second matching result and default third text database into Row matches：Judge in second matching result with the presence or absence of matching with the word in default third text database Third word；There is the third that matches with the word in default third text database in second matching result when determining When word, the third word to match in second matching result is rejected.

Generation module 203, for having the voice of editable state according to the speech recognition text generation after the matching Identify text rough draft.

Whether detecting module 204 has received edit operation on the speech recognition text rough draft for detecting.

Whether having received edit operation on 204 detecting voice of the detecting module identification text rough draft includes：Detecting institute It states and whether has received touch operation in the pre-set button on speech recognition text rough draft, received in pre-set button when detecting When touch operation, it is believed that detect and have received edit operation on the speech recognition text rough draft；It is pre- when not detecting If have received touch operation on button, it is believed that detect and be not received by editor behaviour on the speech recognition text rough draft Make.

The generation module 203 is also used to receive on the speech recognition text rough draft when the detecting module detects When having arrived edit operation, according to the speech recognition text generation after the edit operation have can not editing mode speech recognition Text, as final speech recognition text.

The generation module 203 has according to speech recognition text generation after the edit operation can not editing mode Speech recognition text includes：

Preferably, after the edit operation for receiving user is modification operation, described by meeting speech recognition is text Device 20 can also include：Relating module 205, for will correspond to original word at modification and the modified neologism of user into Row associated storage；The identification module 201 is also used to when subsequently through speech recognition technology, will according to modified neologism Conference voice to be identified is converted into text.

Preferably, described to include by the device 20 that meeting speech recognition is text in pre-set text database： Setup module 206, for the corresponding diversified forms of each word to be stored in advance, the diversified forms may include, but be not limited to： Simplified and traditional body form plus space form and nearly word form etc..

It is described that the initial speech is identified that text match with pre-set text database further including：According to locating for meeting Environment, by corresponding with each word of the pre-set text database diversified forms progress of initial speech identification text Match, obtains the speech recognition text for meeting meeting local environment.

It is of the present invention by meeting speech recognition be text device, pass through speech recognition technology for meeting to be identified Voice is converted into text, identifies text as initial speech；By the initial speech identification text and pre-set text database into Row matching, the speech recognition text after being matched；There is editable shape according to the speech recognition text generation after the matching The speech recognition text rough draft of state；When detect have received edit operation on the speech recognition text rough draft when, according to institute Speech recognition text generation after stating edit operation have can not editing mode speech recognition text, as final speech recognition Text.After passing through ASR technology and being converted to Text Mode to the voice of meeting, by calling the content for having deposited dictionary in text Appearance scans for, and the operation such as is replaced, deleted accordingly respectively.But it will lead to due to the minimum probability of operation such as replacing and deleting It is the text of mistake by original correct text modification, so modified text can be in by system again with revisable mode Now give user, while the place for marking out modified mistake confirms for user, user can place to system misoperation again It modifies.I.e. by carrying out after tentatively identifying to voice to be identified, first time matching is carried out with pre-set text library, then pass through It is artificial to carry out second of confirmation.Process can effectively ensure that the correctness of text output content twice, improve traditional voice Turn the unreasonable place of literal expression in text, effectively reduces the proof-reading amount of conference content, improve efficiency.

The above-mentioned integrated unit realized in the form of software function module, can store and computer-readable deposit at one In storage media.Above-mentioned software function module is stored in a storage medium, including some instructions are used so that a computer It is each that equipment (can be personal computer, double screen equipment or the network equipment etc.) or processor (processor) execute the present invention The part of a embodiment the method.

Embodiment three

Fig. 3 is the schematic diagram for the electronic equipment that the embodiment of the present invention five provides.

The electronic equipment 3 includes：Memory 31, at least one processor 32 are stored in the memory 31 and can The computer program 33 and at least one communication bus 34 run at least one described processor 32.

At least one described processor 32 realized when executing the computer program 33 it is above-mentioned by meeting speech recognition for text Step in this embodiment of the method.

Illustratively, the computer program 33 can be divided into one or more module/units, it is one or Multiple module/units are stored in the memory 31, and are executed by least one described processor 32, to complete this hair It is bright.One or more of module/units can be the series of computation machine program instruction section that can complete specific function, this refers to Enable section for describing implementation procedure of the computer program 33 in the electronic equipment 3.

The electronic equipment 3 can be the calculating such as desktop PC, notebook, palm PC and cloud server and set It is standby.It will be understood by those skilled in the art that the schematic diagram 3 is only the example of electronic equipment 3, do not constitute to electronic equipment 3 restriction may include perhaps combining certain components or different components, such as institute than illustrating more or fewer components Stating electronic equipment 3 can also include input-output equipment, network access equipment, bus etc..

At least one described processor 32 can be central processing unit (Central Processing Unit, CPU), It can also be other general processors, digital signal processor (Digital Signal Processor, DSP), dedicated integrated Circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field- Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic, Discrete hardware components etc..The processor 32 can be microprocessor or the processor 32 is also possible to any conventional processor Deng the processor 32 is the control centre of the electronic equipment 3, utilizes various interfaces and the entire electronic equipment 3 of connection Various pieces.

The memory 31 can be used for storing the computer program 33 and/or module/unit, and the processor 32 passes through Operation executes the computer program and/or module/unit being stored in the memory 31, and calls and be stored in memory Data in 31 realize the various functions of the electronic equipment 3.The memory 31 can mainly include storing program area and storage Data field, wherein storing program area can application program needed for storage program area, at least one function (for example sound plays Function, image player function etc.) etc.；Storage data area, which can be stored, uses created data (such as sound according to electronic equipment 3 Frequency evidence, phone directory etc.) etc..In addition, memory 31 may include high-speed random access memory, it can also include non-volatile Memory, such as hard disk, memory, plug-in type hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card), at least one disk memory, flush memory device or other Volatile solid-state part.

If the integrated module/unit of the electronic equipment 3 is realized in the form of SFU software functional unit and as independent Product when selling or using, can store in a computer readable storage medium.Based on this understanding, the present invention is real All or part of the process in existing above-described embodiment method, can also instruct relevant hardware come complete by computer program At the computer program can be stored in a computer readable storage medium, which is being executed by processor When, it can be achieved that the step of above-mentioned each embodiment of the method.Wherein, the computer program includes computer program code, described Computer program code can be source code form, object identification code form, executable file or certain intermediate forms etc..The meter Calculation machine readable medium may include：Any entity or recording medium, USB flash disk, the movement of the computer program code can be carried Hard disk, magnetic disk, CD, computer storage, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium etc..It needs to illustrate It is that the content that the computer-readable medium includes can be fitted according to the requirement made laws in jurisdiction with patent practice When increase and decrease, such as in certain jurisdictions, according to legislation and patent practice, computer-readable medium does not include electric carrier wave letter Number and telecommunication signal.

In several embodiments provided by the present invention, it should be understood that disclosed electronic equipment and method, Ke Yitong Other modes are crossed to realize.For example, electronic equipment embodiment described above is only schematical, for example, the unit Division, only a kind of logical function partition, there may be another division manner in actual implementation.

It, can also be in addition, each functional unit in each embodiment of the present invention can integrate in same treatment unit It is that each unit physically exists alone, can also be integrated in same unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of hardware adds software function module.

It is obvious to a person skilled in the art that invention is not limited to the details of the above exemplary embodiments, Er Qie In the case where without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the present invention is by appended power Benefit requires rather than above description limits, it is intended that all by what is fallen within the meaning and scope of the equivalent elements of the claims Variation is included in the present invention.Any reference signs in the claims should not be construed as limiting the involved claims.This Outside, it is clear that one word of " comprising " is not excluded for other units or, odd number is not excluded for plural number.The multiple units stated in system claims Or device can also be implemented through software or hardware by a unit or device.The first, the second equal words are used to indicate name Claim, and does not indicate any particular order.

Finally it should be noted that the above examples are only used to illustrate the technical scheme of the present invention and are not limiting, although reference Preferred embodiment describes the invention in detail, those skilled in the art should understand that, it can be to of the invention Technical solution is modified or equivalent replacement, without departing from the spirit of the technical scheme of the invention range.

Claims

1. it is a kind of by meeting speech recognition be text method, which is characterized in that the method includes：

Initial speech identification text is matched with pre-set text database, the speech recognition text after being matched；

When detect have received edit operation on the speech recognition text rough draft when, according to the voice after the edit operation Identify text generation have can not editing mode speech recognition text, as final speech recognition text.

2. the method as described in claim 1, which is characterized in that described by initial speech identification text and pre-set text number Carrying out matching according to library includes：

Second matching result is matched with default third text database；

Wherein, it is stored with multiple modal particles in default first text database, is deposited in default second text database Multiple professional words and corresponding phonetic are contained, is stored with multiple taboo sensitive words in the default third text database.

3. method according to claim 2, which is characterized in that described by initial speech identification text and default first text Database carries out matching：

Judge in initial speech identification text with the presence or absence of the to match with the word in default first text database One word；

When determining in initial speech identification text there is first to match with the word in default first text database When word, whether the first word to match according to the modal particle model judgement based on deep learning network of training in advance For modal particle to be deleted；

When the first word to match described in the determination is modal particle to be deleted, the initial speech is identified into phase in text The first word matched is rejected；

When the first word to match described in the determination is not modal particle to be deleted, the initial speech is identified into phase in text Matched first word is retained.

4. method according to claim 2, which is characterized in that described by first matching result and default second textual data Carrying out matching according to library includes：

Word in first matching result is converted into the first phonetic；

When determining in default second text database in the presence of the second phonetic identical with first phonetic, second is spelled The corresponding word of sound extracts, as the corresponding word of the first phonetic.

5. method according to claim 2, which is characterized in that described by second matching result and default third textual data Carrying out matching according to library includes：

Judge in second matching result with the presence or absence of the third word to match with the word in default third text database Language；

There is the third word that matches with the word in default third text database in second matching result when determining When, the third word to match in second matching result is rejected.

6. the method as described in claim 1, which is characterized in that described raw according to the speech recognition text after the edit operation At have can not the speech recognition text of editing mode include：

When the edit operation received be confirmation operation when, directly generate with can not editing mode speech recognition text；

When the edit operation received is modification operation, receives the manual modification of user and save the new content of modification, when again It is secondary when receiving confirmation operation, generate have can not editing mode speech recognition text.

7. the method as described in claim 1 to 6 any one, which is characterized in that the method also includes：

When subsequently through speech recognition technology, text is converted for conference voice to be identified according to modified neologism.

8. the method as described in claim 1 to 6 any one, which is characterized in that the method also includes：

The corresponding diversified forms of each word are stored in advance, the diversified forms include：Simplified and traditional body form plus space form and shape Nearly word；

Initial speech identification text diversified forms corresponding with each word of pre-set text database are matched.

9. a kind of electronic equipment, which is characterized in that the electronic equipment includes processor and memory, and the processor is for holding It realizes when the computer program stored in the row memory as claimed in any of claims 1 to 8 in one of claims by conference voice The method for being identified as text.

10. a kind of computer readable storage medium, computer program, feature are stored on the computer readable storage medium It is, realizes when the computer program is executed by processor as claimed in any of claims 1 to 8 in one of claims by conference voice The method for being identified as text.