CN105159870A

CN105159870A - Processing system for precisely completing continuous natural speech textualization and method for precisely completing continuous natural speech textualization

Info

Publication number: CN105159870A
Application number: CN201510364578.XA
Authority: CN
Inventors: 徐信
Original assignee: Individual
Current assignee: Beijing Zhongke Mosi Technology Co ltd
Priority date: 2015-06-26
Filing date: 2015-06-26
Publication date: 2015-12-16
Anticipated expiration: 2035-06-26
Also published as: CN105159870B

Abstract

The invention belongs to the technical field of speech textualization, and particularly relates to a processing system for precisely completing continuous natural speech textualization and a method for precisely completing the continuous natural speech textualization. The processing system comprises a cloud speech recognition engine and a speech recognition post-modification platform, wherein the voice recognition post-modification platform is connected with the cloud voice recognition engine. The processing system and the method have the following effects that the audio and video speech information is collected in real time; the information collection is uninterrupted; the grade reaches the millisecond stage; the collection rate reaches 100 percent; the information loss rate is 0; and the accuracy of the recorded characters corresponding to the speech is about 100%. A scientific and convenient humanized man-computer interaction system for speech textualization is built; 100 percent of conversion rate of the speech textualization and 99.7 percent or more of accuracy are realized; and an electronic integration file integrating three dimensions of audio, video and characters is created.

Description

A kind of disposal system and method precisely completing continuous natural-sounding text

Technical field

The invention belongs to speech text technical field, be specifically related to a kind of disposal system and the method that precisely complete continuous natural-sounding text.

Background technology

Along with the development of the information processing technology, carry out man-machine interaction with natural language and become a reality.The key realizing man-machine interaction is the natural language instruction of wanting accurate understanding user to send or obtain and operates accordingly.User is after sending or obtain natural language instruction, and this instruction is converted into speech text.In China 100 for many years, how continuous print natural-sounding is converted in real time to the shorthand of word, become the problem that people constantly explore and study always.

The main carrier of current shorthand is computing machine, but computer stenography is through facts have proved for many years: the pattern of shorthand socialized service system at present can not meet the demand of market profound level.

Tradition shorthand, as handwriting stenograph, professional stenographic machine, the shorthand of common computer keyboard all also exist: the training cycle is long, rate of becoming a useful person is low, not easily popularizes.High-pressure when stenographer works, labour intensity are large.Most of stenographer also can not have the rudimentary knowledge of each professional domain, and be difficult to the needs meeting different industries the shorthand profession, work quality is difficult to the problems such as guarantee.

Therefore, needing to design and Implement a kind of is the speech text workbench of main body by speech recognition technology, substitutes and takes down in short-hand based on the tradition of manual keyboard technology.Be the self service pattern of our unit by the mode-conversion of shorthand socialized service system, the talent making this specialty high-quality does not need to carry out high strength, for a long time specialty shorthand and trains, and just can complete the work that voice are converted into word by our unit, this specialty.

Native system is just for the consideration of above-mentioned aspects, set up the speech text workbench based on speech recognition technology, thus realize the labour intensity, the raising work quality that reduce shorthand staff, and it is excessively self service to all kinds of personnel of internal institution from the shorthand service of professional stenographer to realize traditional the shorthand profession, this is the demand of market depth development.

When there is above-mentioned defect based on the shorthand of the traditional computer of manual keyboard technology, speech recognition has highlighted its advantage.New technology based on Computer Distance Education replaces the certainty that professional stenographic machine based on manual skill or computer keyboard shorthand are computer science and technology development.

To Chinese speech, under the condition of reasonable mandarin, articulation, the Mandarin speech recognition rate of current China can reach the level of 90% or higher.Also there is following defect in speech recognition simultaneously:

Mandarin speech recognition develops into today, is still faced with a lot of challenge, and the accuracy rate of speech recognition is subject to the restriction of various factors.

(1) Chinese phonetically similar word word problem is very serious

Chinese character is very ancient original word, and Chinese uses the Chinese character of not phonetic as the word of record voice.Chinese phonetically similar word word problem is very serious.

(2) Chinese speech the local dialect family of languages is various

In Han nationality, because dialect is different, and be divided into eight large people systems.I.e. northern language, phase language, the Wu dialect, Jiangxi language, Guangdong language (wide mansion language), micro-language, the south of Fujian Province language (comprising the south of Fujian Province, Hainan, Chaozhou, the square language in four kinds, Leizhou) the north of Fujian Province language and the generous speech system of Hakka eight.

At present, Mandarin speech recognition is confined to standard Chinese more clearly substantially.The speech recognition of the local dialect and local intonation mandarin, reaches actual application level, needs time.The solution of this problem current, can rely on simultaneous interpretation, by the people understanding dialect with the mandarin read to transfer to system and can identify.

(3) level of mandarin varies with each individual, and the accuracy of speech recognition also varies with each individual, and is not 100% accurate.

(4) impact of playback environ-ment, speech recognition identifies microphone voice voice, completes the task of speech text.With the physics noise of the background noise of voice, transmission equipment, phonetic entry volume mistake by force, the accuracy that all can affect speech recognition such as excessively weak.

Summary of the invention

In order to effectively solve the problem, the invention provides a kind of disposal system and the method that precisely complete continuous natural-sounding text.Technical matters to be solved by this invention is: Real-time Collection audio frequency and video voice messaging, build the workbench of a speech text based on speech recognition technology, realize the conversion ratio of speech textization 100% and the rate of precision of more than 99.7%, realize precisely completing the process of continuous natural-sounding text, and create the three-dimensional integrated electronics integrated document of audio frequency, video, word.

Concrete technical scheme of the present invention is as follows: a kind of disposal system accurately completing continuous natural-sounding text, described disposal system revises platform after comprising high in the clouds speech recognition engine and speech recognition, revises platform and be connected with described high in the clouds speech recognition engine after described speech recognition.

Further, revise platform after described speech recognition and comprise display unit, retouching operation unit, control module and three-dimensional integrated generation unit; Described display unit, retouching operation unit and three-dimensional integrated generation unit are all connected on described control module.

Further, described three-dimensional integrated generation unit generates the three-dimensional integrated electronics integrated document (i.e. file destination) of voice, image and word, and described voice, image associate with word one_to_one corresponding;

Described display unit shows the audio-video document image comprising operation tool hurdle, audio volume control figure, audio-frequency information and word content list and video display frame simultaneously;

Described retouching operation unit comprises the retouching operation mode that speech modification, keyboard amendment, mouse amendment and keyboard add mouse.

Further, described control module comprises audio frequency abstraction module, graduation process judge module, transducing audio oscillogram module, three-dimensional integrated relating module, central processing module;

Described audio frequency abstraction module, graduation process judge module, transducing audio oscillogram module, three-dimensional integrated relating module are all connected on described central processing module, described central processing module logic connects described display unit, and described retouching operation unit connects on described central processing module.

Further, described high in the clouds speech recognition engine comprises Chinese speech merogenesis processing module, Mandarin speech recognition module.

Further, the phonetic segmentation of input is become trifle by described Chinese speech merogenesis processing module, make cut-off at pause place of voice or the place that in short finishes, described cut-off is the low spot of speech energy, and Chinese speech merogenesis processing module exports as the split time information for input voice.

Further, described Mandarin speech recognition module comprises: Chinese speech feature extraction unit, Chinese speech force cutting unit, the Chinese phonetic alphabet mark unit, the daily lexical unit of Chinese, Chinese acoustic model unit, Chinese language model unit and neologisms self-adapting estimation unit to text conversion recognition unit, Chinese speech character associative message unit, Chinese;

Chinese speech feature extraction unit: input be the 16K sampling of recording through microphone USB sound card, the Chinese speech data after the segmentation that PCM is linear 16, output be Mel-cepstrum feature for inputting segmentation voice;

Chinese speech is to text conversion identification core cell: sampling for being identified the 16K recorded through microphone USB sound card of input, and linear 16 the voice Mel-cepstrum features of PCM, export the word content into this section of voice;

Chinese speech, word, image related information unit: the word that identification module is exported and the original 16K sampling of recording through microphone USB sound card, image corresponding relation Time Created of linear 16 voice of PCM and synchronous acquisition;

Cutting unit forced in Chinese: be input as the 16K sampling of recording through microphone USB sound card, the grapholect answer that linear 16 voice of PCM and this section of voice are identified, the information that time of output character and voice is corresponding;

Chinese phonetic alphabet mark unit: the word for user's input to carry out the mark of phonetic according to the requirement of language model, in order to language model identification;

The daily lexical unit of Chinese: this unit is that standard Chinese pinyin marking uses, and provides guiding knowledge for language model;

Chinese acoustic model unit: this model unit provides acoustics to instruct knowledge for speech recognition engine;

Chinese language model unit: this model provides language guiding knowledge for speech recognition engine;

Neologisms self-adapting estimation unit: regenerate language model to the neologisms added, system computing machine is carried out to text and the phonetic of the specialized word that Text Input first time occurs, occur this word again in later voice, system just can identify.

Accurately complete a method for continuous natural-sounding text, apply above-mentioned disposal system, said method comprising the steps of:

A, obtain audio/video flow or audio-video document by on-the-spot audio/video information acquisition system or the audio-video document that completed collection;

B, audio/video flow or audio-video document are carried out pre-service;

C, pretreated audio/video flow or audio-video document be uploaded in the speech recognition engine of high in the clouds and carry out cutting and identify;

The feedback cutting of D, high in the clouds speech recognition engine and recognition result;

E, to high in the clouds speech recognition engine feedback cut-off adjust;

F, modify to the speech text after adjustment cut-off, alter mode comprises: by manually with reading to carry out speech recognition again, directly carrying out speech recognition again and keyboard amendment;

G, above-mentioned amended speech text carried out to basis check and correction;

After H, basis have been proofreaded, proceed check and correction in full;

I, proofreaded in full after, carry out editing, typesetting;

J, generating object file storing.

Further, described on-the-spot audio/video information acquisition system in steps A comprises the audio frequency and video input equipment being connected to computing machine, the on-the-spot collection carrying out audio frequency and video voice messaging is implemented in described audio frequency input, then sends disposal system to, carries out text process in real time;

On-the-spot audio/video information collection comprises local collection and place remote gathers two kinds of patterns, and the acquisition mode that the collection of described this locality and place remote gather two kinds of patterns includes file acquisition mode and Streaming Media acquisition mode;

I, file acquisition mode

I1: first start, initialization audio frequency and video collecting device, gathers audiovisual presentation and voice automatically by voice capture device microphone and video capture device video camera;

I2: setting gathers duration arbitrarily, and system preserves into a clip file by the duration of setting automatically, is automatically uploaded to disposal system;

I3: also can start to gather and terminate to gather by manual control, manual control and the mode automatically controlling to combine can also be completed the collection of fragment audio-video document;

I4: can input file prefix when manually starting to gather, the prefix of the clip file generated in automatic gatherer process remains unchanged, and manually terminates to gather, and again inputs new prefix when again starting to gather;

I5: the clip file of collection can be merged into a file automatically according to prefix;

I6: manually select arbitrarily the clip file of some collections to be merged into a file;

I7: system merges the processed clip file completed automatically;

Ii, Streaming Media acquisition mode

Ii1: automatically gather audiovisual presentation and voice by voice capture device microphone and video capture device video camera;

Ii2: the audio/video flow will gathered in real time, is uploaded to disposal system;

Ii3: carry out hard disk backup while uploading audio/video flow.

In described step B, the pre-service of audio-video document is comprised the following steps:

B11 audio frequency and video voice document encoded recording: be recorded as temporary file, outer invisible to system, a large amount of internal memory can not be taken at the scene in gatherer process cause system crash, simultaneously Time Created index, to use with during aftertreatment, the time that system can record is determined by the surplus in hard disc of computer space, audio/video information disposal system place, and under default record form, disk consumption per hour is about 5G;

B12 audio frequency and video voice document be separated: by audio stream from combined paper independent separate out, video flowing keeps original pattern;

B13 audio stream sample frequency is changed, and applies different audio sampling frequencies and code rate for different collection files;

B14 voice merogenesis process: speech audio segment sound signal being decomposed into special time interval, and retain and video sequential corresponding informance.The merogenesis place of each trifle should be finish place or intermediate hold place in short in short; Speech audio after merogenesis process is passed to disposal system, and disposal system waveform display module shows its waveform according to the speech audio received

To the pre-service of audio/video flow in above-mentioned steps B, specifically comprise the following steps:

Audio/video flow is directly passed to disposal system by B21 collecting device, and disposal system waveform display module is according to the audio stream real-time update display waveform received;

Streaming medium content is play-overed in B22 disposal system processing procedure;

The voice flow that B23 receives directly mails to high in the clouds speech recognition engine and identifies, and returns recognition result after carrying out paragraph cutting by high in the clouds speech recognition engine to voice flow;

The voice flow recognition result that B24 returns, readjusts cut-off by revising platform after speech recognition, again delivers to high in the clouds speech recognition engine engine identification.

In above-mentioned steps F, the alter mode of directly carrying out speech recognition is again: through each audio fragment of voice merogenesis process, delivers to speech recognition engine successively and transforms the direct word for correspondence;

The input information of relatively good for mandarin level, clear voice, speech recognition engine is directly selected to carry out text process, speech conversion is directly become word according to input audio-frequency information by high in the clouds speech recognition engine, and preserves the word time sequence information corresponding with audio frequency;

In above-mentioned steps F, with reading to carry out speech recognition alter mode be again manually: system plays is recorded, treatment people repeats system log (SYSLOG), speech recognition engine identification, manually with the audio-frequency information read not only as the information of interpreting in process, be also recorded in output file;

In above-mentioned steps F, keyboard amendment alter mode is specially: for the voice of input, treatment people is directly dictated by sentence by keyboard entry method, is manually converted to text message.

Further, the file destination check and correction in described step G is: file destination proof-reading is the process revised again generation text, is divided into two steps: basis check and correction, in full check and correction, and check and correction process only generates text-only file, does not comprise composition information;

The check and correction of wherein said basis is the check and correction based on phrase or sentence, single, the several phrase of each combination or fragment voice, by the speech play after combination, and the Word message changed out of corresponding display institute, according to playing voice and shown word is proofreaded;

Described full text check and correction is the check and correction based on paragraph or full text, and the single or several voice messaging in each combination foundation check and correction, by the speech play after combination, and shows corresponding text message, proofread according to broadcasting voice and shown word;

Editor in described step I and typesetting are specially: carry out interpolation punctuate, numeral, line feed, segmentation to the text message determined, add the operations such as large subhead and space;

Generating object file in described step J is also stored as: document, video and the voice messaging packing after check and correction, typesetting is generated three-dimensional integrated electronics integrated document (i.e. file destination) and exports, store.This file can be copied, transmit, and can be opened by audio/video information searching system, browsing and querying, but can not to be modified.

Beneficial effect of the present invention: the leftover problem that the limitation that the invention solves speech recognition is brought, effect is as follows: 1, Real-time Collection audio frequency and video voice messaging, and information acquisition is uninterrupted, and to Millisecond, acquisition rate reaches 100%; Information dropout rate is 0; 2, designed and Implemented a kind of science, easily, feasible method, build the amendment platform of the strong speech text of a human, make operating personnel carry out the rear amendment of speech recognition easily, make voice reach the rate of precision of nearly 100% with the word recorded.Set up speech text science, hommization man-machine interactive system easily; 3, the conversion ratio of speech textization 100% and the rate of precision of more than 99.7% is realized; 4, the electronics integrated document that audio frequency, video, word are three-dimensional integrated is created.

Accompanying drawing explanation

Fig. 1 is for precisely to complete continuous natural-sounding text process flow diagram;

Fig. 2 is file acquisition mode process flow diagram;

Fig. 3 is Streaming Media acquisition mode process flow diagram;

Fig. 4 is place remote acquisition stream acquisition of media mode treatment process flow diagram;

Fig. 5 is that place remote gathers file drainage pattern processing flow chart;

Fig. 6 is audio frequency and video voice messaging text disposal system structural drawing;

Fig. 7 is audio frequency and video voice pretreatment process figure;

Fig. 8 revises platform process figure after speech text.

Embodiment

Be described in detail to technical scheme of the present invention below in conjunction with example, obviously, described example is only a part very little in the present invention, instead of whole examples.Based on the example in the present invention, those skilled in the art, not making the every other example obtained under creative work prerequisite, belong to the scope of protection of the invention.

As shown in Figure 1, be a kind of process flow diagram accurately completing continuous natural-sounding text method that the embodiment of the present invention provides, a kind of method precisely completing continuous natural-sounding text specifically comprises the following steps:

B, audio/video flow or audio-video document are carried out pre-service;

E, to high in the clouds speech recognition engine feedback cut-off adjust;

After H, basis have been proofreaded, proceed check and correction in full;

I, proofreaded in full after, carry out editing, typesetting;

J, carry out file destination storage or save as Word compatible documents.

The present invention applies above-mentioned multi-step, easily occurs erroneous point, solution all targetedly, below will be explained in detail in audio-video document text process.

Wherein described in steps A, on-the-spot audio/video information acquisition system comprises the audio frequency and video input equipment being connected to computing machine, implements the on-the-spot collection carrying out audio frequency and video voice messaging, then sends disposal system to, carry out text process in real time.

On-the-spot audio/video information collection comprises local collection and place remote gathers two kinds of patterns, the acquisition mode that the collection of described this locality and place remote gather two kinds of patterns includes file acquisition mode and Streaming Media acquisition mode, is illustrated in figure 2 the process flow diagram of file acquisition mode; Be illustrated in figure 3 the process flow diagram of Streaming Media acquisition mode.

I, file acquisition mode

I1: first start, initialization audio frequency and video collecting device, gathers audiovisual presentation and voice automatically by voice capture device microphone and video capture device video camera (head).

I2: setting gathers duration (minute) arbitrarily, and system preserves into a clip file by the duration of setting automatically, is automatically uploaded to disposal system.

I3: also can start to gather and terminate to gather by manual control, manual control and the mode automatically controlling to combine can also be completed the collection of fragment audio-video document.

I4: can input file prefix when manually starting to gather, the prefix of the clip file generated in automatic gatherer process remains unchanged, and manually terminates to gather, and again inputs new prefix when again starting to gather.

I5: the clip file of collection can be merged into a file automatically according to prefix.

I6: can manually select arbitrarily the clip file of some collections to be merged into a file.

I7: system merges the processed clip file completed automatically.

Ii, Streaming Media acquisition mode

Ii1: automatically gather audiovisual presentation and voice by voice capture device microphone and video capture device video camera (head);

Ii2: the audio/video flow will gathered in real time, is uploaded to disposal system.

Ii3: carry out hard disk backup while uploading audio/video flow.

Be illustrated in figure 4 the acquisition process process flow diagram of place remote acquisition system.

Be illustrated in figure 5 place remote and gather file drainage pattern processing flow chart, described place remote acquisition system and acquisition system and disposal system are divided and are located on two computing machines of different location, are connected, the audio-video document collected or Streaming Media are sent to the disposal system in strange land by the wired or wireless network of computing machine.Now, processing locality system is as the receiving end of place remote acquisition system.When remote collection, the audio-video document or Streaming Media that gather are retained in acquisition system this locality, as backup, ensure that the audio/video information gathered can not be lost.During remote collection, disposal system controls the work of acquisition system by network remote, forms unmanned collection terminal, the remote controlled function starting to gather, terminate collection, start all acquisition systems such as the transmission of image data.

In above-mentioned steps F, be specially for the speech recognition again of directly carrying out in alter mode: through each audio fragment of voice merogenesis process, deliver to speech recognition engine successively and transform the direct word for correspondence.

The input information of relatively good for mandarin level, clear voice, directly selects speech recognition engine to carry out text process.Speech conversion is directly become word according to input audio-frequency information by speech recognition engine, and preserves the word time sequence information corresponding with audio frequency.

In above-mentioned steps F, be specially with reading to carry out speech recognition again for artificial in alter mode: for the input information of the sound bite that the local dialect, unsharp mandarin or computing machine can not identify, after reading, select speech recognition engine to carry out text process by speech processes personnel.Flow process is system plays recording, and treatment people repeats system log (SYSLOG), speech recognition engine identification.Manually with the audio-frequency information read not only as the information of interpreting in process, be also recorded in output file.

Note: when being live recording text, and input equipment is when being microphone, this function must use the computing machine containing two pieces of sound cards (containing two or more MIC input) to complete, if computing machine only has one piece of sound card, (only having a MIC input port) cannot use this function to carry out text generation.

In above-mentioned steps F, be specially for the keyboard amendment in alter mode: for the voice of input, treatment people is directly dictated by sentence by keyboard entry method, is manually converted to text message.

File destination check and correction in above-mentioned steps G is specially:

File destination proof-reading is the process revised again generation text, is divided into two steps: basis check and correction, in full check and correction.Check and correction process only generates text-only file, does not comprise composition information.

I, basis check and correction:

Basis check and correction is check and correction based on phrase or sentence, and single, the several phrase of each combination or sentence, play, and corresponding display the Word message changed out.Treatment people can carry out proof-reading according to the voice messaging play and shown word, and adds punctuation mark simultaneously.

During check and correction, voice messaging is reset, and adopts mode below:

● according to interval time (by 1 second to the n second) voice messaging of automatically resetting of user's setting,

● according to specific shortcut (i.e. on & off switch) the playback voice messaging of system definition.

When carrying out word check and correction, mode below can be selected:

● proofread at special check and correction row, separate with the display line of segmentation language and characters;

● in display line this locality check and correction amendment of segmentation language and characters;

● when proofreading certain section of language and characters, it is capable that the large window of real-time ejection shows proofread language and characters, proofreads in specific large window.

Full text check and correction in ii, above-mentioned steps H is specially:

Check and correction is the check and correction based on paragraph or full text in full, and the single or several voice messaging in each combination foundation check and correction, by the speech play after combination, and shows corresponding text message.Treatment people can carry out proof-reading according to voice messaging and shown word, and adds or amendment punctuation mark.

During check and correction, the playback of voice messaging adopts timing and the mode that combine of shortcut, voice messaging of automatically resetting the interval time set according to user, or according to the specific shortcut playback voice messaging of system definition.

Realize voice, image, word synchronously play, keep word and voice, the accurate corresponding sequential correlation of image.When playing voice, image, the word corresponding with voice can automatic highlighted display.

File destination editor in above-mentioned steps I-J, typesetting are specially:

What obtain after conversion with check and correction is plain text information, and text information does not comprise any composition information.Native system provides simple editor and typesetting function.

To original text information additions and deletions content, the information such as punctuate, numeral, line feed, segmentation and space cannot can be added under editting function.Typesetting function provide defined font size, font combination, be labeled as the contents of a project such as headline, subhead.User can select corresponding combination according to the needs of oneself, and is used on selected word by the form that this combination defines.The preview mode of page type is not provided.User can needing by text by plain text or rich text format guiding system according to oneself, and import in other composing system and carry out Typeset and Print.

The function that printing function provides document to export according to the format print after typesetting.This function depends on the realization of .NetFramework printing function, provides that printer is selected, paper mold is selected and can select whether duplex printing according to printer situation.

System documentation senior editor function, by exporting Microsoft Word compatible format document (RTF), then calls Advanced Typesetting and the printing function that MicrosoftOfficeWord realizes document.This function needs the version terminal of user being provided with more than MicrosoftOfficeWord2003.

Generation and the output of file destination are specially:

After treatment people completes check and correction typesetting, system can by document, video and the voice messaging packing generation system after check and correction support the output file of form, this file can be copied, transmit, and can be opened by audio/video information searching system, browsing and querying, but can not to be modified.

The processing speed that file destination exports is determined according to audio/video information time span, and under proposed arrangement, processing speed is higher than compression process per second 2.5 seconds audio/video informations.

Audio/video flow compression coding is specially:

When after the complete generation relevant documentation of speech signal analysis, relevant documentation and audio-video document are packed by system, and are compressed according to fixing coding and compressed format by audio-video document.Audio/video information stores in the compressed format in file, takes up room to save hard disk, and generation can for retrieval engine but not revisable associated documents.Compression speed is determined by the speed of selected form and computing machine.In above processing procedure, remain the consistance that audio & video is associated.

Be illustrated in figure 6 the schematic diagram of audio frequency and video voice messaging text disposal system of the present invention.

Fig. 7 is the pretreatment process figure of audio frequency and video voice messaging text disposal system.

The embodiment of the present invention provides a kind of audio frequency and video voice messaging text disposal system, revises platform after this system described comprises high in the clouds speech recognition engine and speech recognition.

Described high in the clouds speech recognition engine comprises Chinese speech merogenesis processing module, Mandarin speech recognition module, and described high in the clouds speech recognition engine specifically processes and realizes step B-D in said method.

The large section phonetic segmentation of input is become thin trifle by wherein said Chinese speech merogenesis processing module, make cut-off at pause place of voice or the place that in short finishes, described cut-off is the low spot of speech energy, the length of each trifle is different with the particular content of teller, generally at the several word of 10-20.The phonetic Chinese mandarin pronunciation data that " speech recognition device " that " speech data " that described Chinese speech merogenesis processing module inputs is engine needs.The output of engine is the split time information for input voice.

Mandarin speech recognition module comprises: Chinese speech feature extraction unit, Chinese speech force cutting unit, the Chinese phonetic alphabet mark unit, the daily lexical unit of Chinese, Chinese acoustic model unit, Chinese language model unit and neologisms self-adapting estimation unit to text conversion recognition unit, Chinese speech character associative message unit, Chinese.

A, Chinese speech feature extraction unit: module input be the 16K sampling of recording through microphone USB sound card, the Chinese speech data after the segmentation that PCM is linear 16, the Mel-cepstrum feature that what this module exported is for inputting segmentation voice.

B, Chinese speech are to text conversion identification core cell: sampling for being identified the 16K recorded through microphone USB sound card of input, linear 16 the voice Mel-cepstrum features of PCM, export the word content into this section of voice.

C, Chinese speech character associative message unit: the word that identification module is exported and the original 16K sampling of recording through microphone USB sound card, linear 16 voice corresponding relations Time Created of PCM.

Cutting unit forced in d, Chinese: be input as the 16K sampling of recording through microphone USB sound card, the grapholect answer that linear 16 voice of PCM and this section of voice are identified, the information that time of output character and voice is corresponding.

E, Chinese phonetic alphabet mark unit: the word for user's input to carry out the mark of phonetic according to the requirement of language model, in order to language model identification.

The daily lexical unit of f, Chinese: this unit is that standard Chinese pinyin marking uses, and provides guiding knowledge for language model.

G, Chinese acoustic model unit: this model unit provides acoustics to instruct knowledge for speech recognition engine.

Above-mentioned acoustic model is created by following step and forms, and obtains multiple standard teacher voice; From described multiple standard teacher voice, select balanced speech parameter, be the equal numerical value of voice of all received pronunciation teachers, wherein, described speech parameter comprises: parameters,acoustic, pitch, cadence information;

According to the speech parameter synthesis tone coupling voice of the equilibrium in described multiple standard teacher voice; Mate voice according to tone and synthesize described Chinese acoustic model unit in conjunction with TD-PSOLA algorithm.

H, Chinese language model unit: this model provides language guiding knowledge for speech recognition engine.

This Chinese speech model described is for being applicable to the speech model providing voice guidance knowledge and storehouse in prior art.

J, neologisms self-adapting estimation unit: provide corresponding instrument, can add neologisms and regenerate language model.System computing machine is carried out to text and the phonetic of the specialized word that Text Input first time occurs, occur this word again in later voice, system just can identify.

After voice flow is uploaded to speech recognition engine, automatically complete identification according to above-mentioned module and unit.

As shown in Figure 8, the leftover problem that platform brings in order to the limitation solving speech recognition is revised after described speech recognition, the present invention proposes the amendment platform of the strong speech text of a human, make operating personnel carry out the rear amendment of speech recognition easily, make voice reach the accuracy of 100% with the word recorded.And integrated further revising acquired achievement after speech recognition and speech recognition, form the electronics integrated document that voice, image and word (i.e. audio frequency, video and word) are three-dimensional integrated.This electronics integrated document combines the word of voice, image and record voice, and maintenance voice, image associate one to one with word.Can carry out easily browsing, retrieving original audio/video information by related text later.

Revise platform after described speech recognition and comprise display unit, retouching operation unit, control module, and three-dimensional integrated generation unit, described display unit, retouching operation unit and three-dimensional integrated generation unit are all connected on described control module.

Described control module is connected with described high in the clouds speech recognition engine, and control module can realize sending to high in the clouds speech recognition engine or receiving audio/video flow or audio-video document; Described display unit display audio-video document, and display simultaneously comprises the audio-video document image of operation tool hurdle, audio volume control figure, audio-frequency information and word content list and video display frame; Described retouching operation unit comprises the retouching operation mode that keyboard amendment, mouse amendment and keyboard add mouse.

Described control module comprises audio frequency abstraction module, graduation process judge module, transducing audio oscillogram module, three-dimensional integrated relating module, central processing module; Described audio frequency abstraction module, graduation process judge module, transducing audio oscillogram module, three-dimensional integrated relating module are all connected on described central processing module, described central processing module logic connects described display unit, and described manual operating unit connects on described central processing module.

Described processing unit receives the instruction of retouching operation unit, and on the display unit, the operation steps of display update operating unit, this display unit described also shows and comprises display video oscillogram hurdle and processes and displays hurdle feedback.

Described central processing module receives and extracts audio-video document, described three-dimensional integrated relating module is verified audio-video document, extract the mapping relations of audio-video document, whether what judge that audio-video document adopts is the mapping relations of voice, audio frequency, video correspondence mutually one by one.

Audio frequency in audio-video document extracts by described audio frequency abstraction module, and the audio frequency extracted is sent to transducing audio oscillogram module.

Audio conversion is oscillogram by described transducing audio oscillogram module, and be sent to display unit by central processing module, the oscillogram changed shows by display unit, this oscillogram described has energy low spot, described graduation process judge module carries out judging that the cut-off of manually adjustment is whether on energy low spot, if on energy low spot, then do not point out, if not on energy low spot, then carry out marking red display.

Claims

1. accurately complete a disposal system for continuous natural-sounding text, it is characterized in that, described disposal system revises platform after comprising high in the clouds speech recognition engine and speech recognition, revises platform and be connected with described high in the clouds speech recognition engine after described speech recognition.

2. a kind of disposal system accurately completing continuous natural-sounding text according to claim 1, is characterized in that, revises platform and comprise display unit, retouching operation unit, control module and three-dimensional integrated generation unit after described speech recognition; Described display unit, retouching operation unit and three-dimensional integrated generation unit are all connected on described control module.

3. a kind of disposal system accurately completing continuous natural-sounding text according to claim 2, it is characterized in that, described three-dimensional integrated generation unit generates the three-dimensional integrated electronics integrated document of voice, image and word, described voice, image and word one_to_one corresponding mapping association;

4. a kind of disposal system accurately completing continuous natural-sounding text according to claim 3, it is characterized in that, described control module comprises audio frequency abstraction module, graduation process judge module, transducing audio oscillogram module, three-dimensional integrated relating module, central processing module;

5. a kind of disposal system accurately completing continuous natural-sounding text according to claim 3, it is characterized in that, described high in the clouds speech recognition engine comprises Chinese speech merogenesis processing module, Mandarin speech recognition module;

The phonetic segmentation of input is become trifle by described Chinese speech merogenesis processing module, make cut-off at pause place of voice or the place that in short finishes, described cut-off is the low spot of speech energy, and Chinese speech merogenesis processing module exports as the split time information for input voice.

6. a kind of disposal system accurately completing continuous natural-sounding text according to claim 5, it is characterized in that, described Mandarin speech recognition module comprises: Chinese speech feature extraction unit, Chinese speech force cutting unit, the Chinese phonetic alphabet mark unit, the daily lexical unit of Chinese, Chinese acoustic model unit, Chinese language model unit and neologisms self-adapting estimation unit to text conversion recognition unit, Chinese speech character associative message unit, Chinese;

Chinese speech character associative message unit: the word that identification module is exported and the original 16K sampling of recording through microphone USB sound card, linear 16 voice corresponding relations Time Created of PCM;

Neologisms self-adapting estimation unit: regenerate language model to the neologisms added, carries out text and the phonetic of the specialized word that Text Input first time occurs for system.

7. accurately complete a method for continuous natural-sounding text, the disposal system that one of application the claims 4-6 is described, it is characterized in that, said method comprising the steps of:

B, audio/video flow or audio-video document are carried out pre-service;

C, the voice flow in pretreated audio/video flow or audio-video document or fragment voice be uploaded in the speech recognition engine of high in the clouds and carry out cutting and identify;

E, to high in the clouds speech recognition engine feedback cut-off adjust;

After H, basis have been proofreaded, proceed check and correction in full;

I, proofreaded in full after, carry out editing, typesetting;

J, generating object file storing.

8. a kind of method accurately completing continuous natural-sounding text according to claim 7, it is characterized in that, described on-the-spot audio/video information acquisition system in steps A comprises the audio frequency and video input equipment being connected to computing machine, the on-the-spot collection carrying out audio frequency and video voice messaging is implemented in described audio frequency and video input, then send disposal system to, carry out text process in real time;

I, file acquisition mode

I2: setting gathers duration arbitrarily, automatically preserves into a clip file by the duration of setting, is uploaded to process high in the clouds speech recognition engine automatically;

I3: manual control starts to gather and terminate to gather, or manual control completes the collection of fragment audio-video document with the mode automatically controlling to combine;

I7: revise platform after speech recognition and automatically merge the processed clip file completed;

II, Streaming Media acquisition mode

II 1: automatically gather audiovisual presentation and voice by voice capture device microphone and video capture device video camera;

II 2: the audio/video flow will gathered in real time, after disposal system pre-service, is uploaded to high in the clouds speech recognition engine by voice flow;

II 3: carry out hard disk backup while uploading audio/video flow;

B11 audio-video document encoded recording: be recorded as temporary file, outer invisible to system, a large amount of internal memory can not be taken at the scene in gatherer process cause system crash, simultaneously Time Created index, to use with during aftertreatment, the time that system can record is determined by the surplus in hard disc of computer space, audio/video information disposal system place, and under default record form, disk consumption per hour is about 5G;

B12 audio-video document be separated: by audio stream from combined paper independent separate out, video flowing keeps original pattern;

B14 voice merogenesis process: audio-frequency fragments sound signal being decomposed into special time interval, and retain and video sequential corresponding informance;

The merogenesis place of each trifle should be finish place or intermediate hold place in short in short;

The voice flow recognition result that B24 returns, readjusts cut-off by revising platform after speech recognition, again delivers to high in the clouds speech recognition engine identification.

9. a kind of method accurately completing continuous natural-sounding text according to claim 7, is characterized in that:

In above-mentioned steps F, the alter mode of directly carrying out speech recognition is again: through each sound bite of voice merogenesis process, delivers to speech recognition engine successively and is converted into corresponding word;

Select speech recognition engine to carry out text process, speech conversion is directly become word according to input audio-frequency information by high in the clouds speech recognition engine, and preserves the word time sequence information corresponding with audio frequency;

10. a kind of method accurately completing continuous natural-sounding text according to claim 7, it is characterized in that: the text proofreading in described step G is: proof-reading is the process revised again generation text, be divided into two steps: basis check and correction, in full check and correction, check and correction process only generates plain text, does not comprise composition information;

The check and correction of wherein said basis is check and correction based on phrase or statement, single, the several phrase of each combination or sentence, the phrase after combination or statement is play, and corresponding display the Word message changed out;

Proofread according to the voice play and shown word;

Described full text check and correction is the check and correction based on paragraph or full text, and the single or several voice messaging in each combination foundation check and correction, by the speech play after combination, and shows corresponding text message;

Proofread according to the voice play and shown word;

Editor in described step I and typesetting are specially: carry out interpolation punctuate, title, line feed segmentation and space to the text message determined and operate;

The generation carrying out file destination in described step J, to be stored as: document, video and the voice messaging packing after check and correction, typesetting is generated three-dimensional integrated file output, storage; And file destination preserves the corresponding sequential relationship of word and video, voice messaging.