CN104751846B

CN104751846B - The method and device of speech-to-text conversion

Info

Publication number: CN104751846B
Application number: CN201510126575.2A
Authority: CN
Inventors: 王彦文
Original assignee: Nubia Technology Co Ltd
Current assignee: Nubia Technology Co Ltd
Priority date: 2015-03-20
Filing date: 2015-03-20
Publication date: 2019-03-01
Anticipated expiration: 2035-03-20
Also published as: CN104751846A

Abstract

The invention discloses a kind of methods of speech-to-text conversion, this method comprises: obtaining audio file；The voice contained in the audio file is converted into text to generate the first text information according to the time shaft of audio file sequence；It gets the recording in the audio file ready label and is converted to text mark；The text mark is inserted into the corresponding position in first text information, to generate the second text information.The invention also discloses a kind of devices of speech-to-text conversion.Using technical solution of the present invention, the text after conversion is marked, people is facilitated the operation such as to check, edit to text.

Description

The method and device of speech-to-text conversion

Technical field

The present invention relates to the method and devices that field of communication technology more particularly to a kind of speech-to-text are converted.

Background technique

With the rapid development of information age, information input/output function importance is added in the electronic device By force.People can be recorded by mobile phone or recording pen (or other equipment with sound-recording function), facilitate record information；It is recording During sound, can also use get function ready, such as when attending a lecture, can record while listening, important content is being recorded When in advance label on, ultimately produce recording file, people can continue to listen back to pervious say subsequently through the recording file When seat content, it can directly listen from label beginning, be listened one time without entirely recording；As that can beg on one side when discussing in session It records by one side, important conference content is marked in advance when record, ultimately produces recording file, people can be subsequent When continuing to listen back to pervious conference content by the recording file, can directly it be listened from label beginning, without entirely recording It listens one time.Speech recognition technology in the prior art, is had been achieved with voice document being converted into text file using more and more extensive It is shown, still, the prior art will be when that will have markd voice document and change into text file, to getting label ready without knowing Not, but voice document is directly converted into text file, it has not been convenient to which people are to the reading of text file, editor, as people think The content (emphasis for understanding record) for getting mark before seeing ready in voice document, cannot be quickly found out, need to open from text It is slowly looked at beginning.

Above content is only used to facilitate the understanding of the technical scheme, and is not represented and is recognized that above content is existing skill Art.

Summary of the invention

The main purpose of the present invention is to provide a kind of method and devices of speech-to-text conversion, it is intended to after conversion Text is marked, and people is facilitated the operation such as to check, edit to text.

To achieve the above object, the present invention provides a kind of method of speech-to-text conversion, this method comprises:

Obtain audio file；

The voice contained in the audio file is converted into text with life according to the time shaft of audio file sequence At the first text information；

It gets the recording in the audio file ready label and is converted to text mark；

The text mark is inserted into the corresponding position in first text information, to generate the second text information.

Preferably, the recording by the audio file gets the step of label is converted to text mark ready and includes:

It obtains the recording in the audio file and gets label ready；

Label and text mark mapping table are got ready according to preset recording, and the recording for searching the acquisition is got label ready and corresponded to Text mark.

Preferably, the text mark is being inserted into first text information, to generate the second text information After step, this method further include:

Word content between identical and adjacent two text mark in second text information is protruded It has been shown that, to generate third text information.

Preferably, in the text between identical and adjacent two text mark by second text information Appearance is highlighted, and includes: the step of third text information to generate

Sequence reads second text information；

If currently reading text mark, the text whether text mark currently read reads with the last time is judged This label is identical；

If the text mark currently read is identical as the text mark that the last time reads, currently read described Word content between text mark and the last text mark read is highlighted, to generate third text information.

Preferably, described, by the text between the text mark currently read and the last text mark read Content is highlighted, and includes: the step of third text information to generate

According to preset text mark and mode mapping table is highlighted, it is corresponding to search the text mark currently read Highlight mode；

By the word content between the text mark currently read and the last text mark read according to described The mode that highlights searched is highlighted, to generate third text information.

In addition, to achieve the above object, the present invention also provides a kind of devices of speech-to-text conversion, comprising:

Module is obtained, for obtaining audio file；

First generation module, the language that will contain in the audio file for the time shaft sequence according to the audio file Sound is converted to text to generate the first text information；

First conversion module is converted to text mark for getting the recording in the audio file ready label；

Second generation module, the corresponding position for being inserted into the text mark in first text information, with Generate the second text information.

Preferably, first conversion module includes:

First acquisition unit gets label ready for obtaining the recording in the audio file；

First searching unit searches the acquisition for getting label and text mark mapping table ready according to preset recording Recording get the corresponding text mark of label ready.

Preferably, the device further include:

Third generation module, for will be between identical and adjacent two text mark in second text information Word content is highlighted, to generate third text information.

Preferably, the third generation module includes:

Reading unit, for sequentially reading second text information；

Judging unit, for judging the text currently read when the reading unit currently reads text mark Originally mark whether identical as the text mark that the last time reads；

Unit is highlighted, for identical as the upper text mark once read in the text mark currently read When, the word content between the text mark currently read and the last text mark read is highlighted, To generate third text information.

Preferably, the unit that highlights includes:

Second searching unit, for identical as the upper text mark once read in the text mark currently read When, according to preset text mark and mode mapping table is highlighted, searches the corresponding protrusion of the text mark currently read Display mode；

Subelement is highlighted, for will be between the text mark currently read and the last text mark read Word content highlighted according to the mode that highlights that second searching unit is searched, to generate third text envelope Breath.

The present invention is by obtaining audio file；It will be contained in the audio file according to the time shaft of audio file sequence Some voices are converted to text to generate the first text information；It gets the recording in the audio file ready label and is converted to text Label；The text mark is inserted into the corresponding position in first text information, to generate the second text information.It is inciting somebody to action It when audio file is converted to text file, gets the recording in audio file ready label and is converted into text mark, and by the text Mark the corresponding position that is inserted into first text information, to generate the second text information, can facilitate people to conversion after Text the operation such as checked, edited.

Detailed description of the invention

Fig. 1 is the flow diagram of the method first embodiment of speech-to-text of the present invention conversion；

Fig. 2 is the refinement flow diagram of step S30 in Fig. 1；

Fig. 3 is the flow diagram of the method second embodiment of speech-to-text of the present invention conversion；

Fig. 4 is the refinement flow diagram of step S50 in Fig. 3；

Fig. 5 is the refinement flow diagram of step S53 in Fig. 4；

Fig. 6 is the functional block diagram of the device first embodiment of speech-to-text of the present invention conversion；

Fig. 7 is the functional block diagram of the device second embodiment of speech-to-text of the present invention conversion；

Fig. 8 is the detailed construction schematic diagram of third generation module in Fig. 7.

The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.

Specific embodiment

It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not intended to limit the present invention.

Referring to Fig.1, Fig. 1 is the flow diagram of the method first embodiment of speech-to-text of the present invention conversion.

The present invention provides a kind of method of speech-to-text conversion, including

S10, audio file is obtained.

In step S10, audio file can be obtained by wired or wireless mode, such as: acquisition can be downloaded from the Internet Audio file, for example downloaded a lecture audio file from the Internet.The audio file includes that label is got in recording ready.

S20, the voice contained in the audio file is converted to life by text according to the time shaft sequence of the audio file At the first text information.

In step S20, digitized the speech by voice-to-text (Speech To Test, STT) function or algorithm Voice is successively extracted, and the voice of extraction is converted to text according to the time shaft of audio file sequence at text, will be turned Each text for changing generation synthesizes the first text information.

S30, it gets the recording in the audio file ready label and is converted to text mark.

In step S30, gets the recording in audio file ready label and be converted to text mark, text marking style Multiplicity can be various colors or icon indicia of various shapes.

S40, the text is marked to the corresponding position being inserted into first text information, to generate the second text information.

In step S40, corresponding recording is marked to get label ready in the position of audio file, by text according to the text Mark the corresponding position that is inserted into the first text information to generate the second text information so that second text information both included by The text that voice is converted into, and include the text mark getting label ready by recording and being converted into.

The embodiment of the present invention is converted to the voice contained in audio file during converting speech-to-text Text gets the recording in audio file ready label and is converted to text mark to generate the first text information, then will be after conversion Text mark is inserted into the corresponding position in first text information, to generate the second text information；The second text after generating This information had not only included the text being converted by voice, but also included the text mark getting label ready by recording and being converted into.User can be square Just the operation such as checked, edited to the second text information, if user is by checking that text mark can be in second text envelope In breath it is open-and-shut find before done recording and get the place of mark ready, without from the beginning of the second text information Successively check.

Further, as shown in Fig. 2, step S30 includes:

Label is got in S31, the recording obtained in the audio file ready.

S32, label and text mark mapping table are got ready according to preset recording, label pair is got in the recording for searching the acquisition ready The text mark answered.

The mapping table that label and text mark are got in the recording ready can be preset according to actual needs, as shown in Table 1.

Table one:

Label is got in recording ready	Text mark
		Get label A ready	Five-pointed star
Get label B ready	Red circle
		Get label C ready	Green triangle shape
……	……

If the recording got in step S31 is got ready labeled as label A is got ready, in step S32, according to pre- If recording get label and text mark mapping table ready, finding this and getting the corresponding text mark of label A ready is five-pointed star.

According to actual needs, the recording can be also updated at any time and get label and text mark mapping table ready, so that the recording is beaten Point label more meets the use habit of user with text mark mapping table.

Referring to the flow diagram for the method second embodiment that Fig. 3, Fig. 3 are speech-to-text of the present invention conversion.

Based on the method first embodiment of above-mentioned speech-to-text conversion, after the step s 40, this method further include:

S50, the word content between identical and adjacent two text mark in second text information is dashed forward It shows out, to generate third text information.

In step S50, the word content between two identical and adjacent text marks is highlighted, i.e., Second text information can be edited automatically, label is got ready to two recording for having done identical and adjacent in audio file Between the corresponding text of voice highlighted automatically, which can be with are as follows: bold, red font Deng.When user the operation such as checks, edits to the third text information, highlighted text open-and-shut can be viewed Content improves efficiency.

Further, as shown in figure 4, step S50 includes:

S51, second text information is sequentially read.

If S52, currently reading text mark, judge what whether the text mark currently read read with the last time Text mark is identical, if they are the same, thens follow the steps S53.

In step S52, if currently reading text mark, the text mark currently read can be added to and read In text mark list, and the last text mark read is found from this read list, then judgement is current reads Text mark and the last text mark read it is whether identical, if they are the same, S53 is thened follow the steps, if not identical, from this Continue to read the second text information in the place for currently reading text mark.

S53, the word content between the text mark currently read and the last text mark read is dashed forward It shows out, to generate third text information.

It, will be in the text between the text mark currently read and the last text mark read in step S53 Appearance is highlighted, and can be edited automatically to the second text information, identical and adjacent to having done in audio file The corresponding texts of voice that two recording are got ready between label are highlighted automatically, which can be with are as follows: Runic, red etc..When user the operation such as checks, edits to the third text information, open-and-shut protrusion can be viewed The content of text of display, improves efficiency.

Further, as shown in figure 5, step S53 includes:

S531, according to preset text mark and mode mapping table is highlighted, search the text mark currently read It is corresponding to highlight mode.

Text label can be preset according to actual needs and highlights mode mapping table, as shown in Table 2.

Table two:

Text mark	Highlight mode
		Five-pointed star	Bold
Red circle	Red font
		Green triangle shape	Green font
……	……

If the text mark currently read is red circle, reflected in the preset text mark with the mode of highlighting It is red font that firing table, which finds the corresponding mode that highlights of the red circle,.

According to actual needs, text label can be also updated at any time and highlights mode mapping table, so that text mark Remember and the mode mapping table of highlighting more meets the use habit of user.

Word content between S532, the text mark for reading the text mark currently read and last time is according to this The mode that highlights searched is highlighted, to generate third text information.

In step S532, according to step S531 find highlight mode to the text mark that currently reads with The word content between text mark that last time reads is highlighted, and can be edited automatically to the second text information, Generate third text information.When user the operation such as checks, edits to the third text information, open-and-shut it can view Highlighted content of text, improves efficiency.

It, should referring to the functional block diagram for the device first embodiment that Fig. 6, Fig. 6 are speech-to-text of the present invention conversion Device includes:

Module 10 is obtained, for obtaining audio file；

First generation module 20, the voice that will contain in the audio file for the time shaft sequence according to the audio file Text is converted to generate the first text information；

First conversion module 30 is converted to text mark for getting the recording in the audio file ready label；

Second generation module 40, for the text to be marked the corresponding position being inserted into first text information, with life At the second text information.

The acquisition module 10 can obtain audio file by wired or wireless mode, such as: can download acquisition sound from the Internet Frequency file, for example downloaded a lecture audio file from the Internet.The audio file includes that label is got in recording ready.

First generation module 20 is turned voice by voice-to-text (Speech To Test, STT) function or algorithm It changes text into, according to the time shaft of audio file sequence, successively extracts voice, and the voice of extraction is converted into text, it will Each text that conversion generates synthesizes the first text information.

First conversion module 30 gets the recording in audio file ready label and is converted to text mark, text label Pattern multiplicity, can be various colors or icon indicia of various shapes.

Second generation module 40 marks corresponding recording to get ready and marks in the position of audio file according to the text, will be literary This label is inserted into the corresponding position in the first text information and generates the second text information, so that second text information both included The text being converted by voice, and include the text mark getting label ready by recording and being converted into.

The embodiment of the present invention, during converting speech-to-text, the first generation module 20 will contain in audio file Some voices are converted to text to generate the first text information, and the first conversion module 30 gets the recording in audio file ready label Text mark is converted to, then the text mark after conversion is inserted into first text information by the second generation module 40 again Corresponding position, to generate the second text information；The second text information after generating not only had included the text being converted by voice, but also Including getting the text mark that label is converted into ready by recording.User can easily check the second text information, edit Operation, as user by check text mark can in second text information it is open-and-shut find before done recording get ready The place of mark, without successively being checked from the beginning of the second text information.

Further, which includes: first acquisition unit 31, for obtaining the record in the audio file Sound gets label ready；First searching unit 32 is searched this and is obtained for getting label and text mark mapping table ready according to preset recording The corresponding text mark of label is got in the recording taken ready.

The mapping table that label and text mark are got in the recording ready can be preset according to actual needs, such as above-mentioned one institute of table Show.

If the recording that first acquisition unit 31 is got, which is got ready, is labeled as getting ready label A, first searching unit 32 Label and text mark mapping table are got ready according to preset recording, and finding this and getting the corresponding text mark of label A ready is five-pointed star.

Referring to the functional block diagram for the device second embodiment that Fig. 7, Fig. 7 are speech-to-text of the present invention conversion.

Based on the device first embodiment of aforementioned present invention speech-to-text conversion, the device further include:

Third generation module 50, for will be between identical and adjacent two text mark in second text information Word content is highlighted, to generate third text information.

The third generation module 50 carries out the word content between two identical and adjacent text marks prominent aobvious Show, the second text information can be edited automatically, two recording for having done identical and adjacent in audio file are got ready The corresponding text of voice between label is highlighted automatically, which can be with are as follows: bold, red Font etc..When user the operation such as checks, edits to the third text information, open-and-shut it can view highlighted Content of text improves efficiency.

Further, as shown in figure 8, the third generation module 50 includes:

Reading unit 51, for sequentially reading second text information；

Judging unit 52, for judging the text currently read when the reading unit 51 currently reads text mark Originally mark whether identical as the text mark that the last time reads；

Unit 53 is highlighted, the text mark for reading in this prior is identical as the text mark that the last time reads When, the word content between the text mark currently read and the last text mark read is highlighted, with Generate third text information.

If the reading unit 51 currently reads text mark, the text mark which will can currently read It is added to and has read in text mark list, and find the text mark of last reading from this read list, then Judge whether the text mark currently read is identical as the text mark of last time reading；If not identical, the reading unit 51 Continue to read the second text information in the place for currently reading text mark from this.

When the text mark currently read is identical as the text mark that the last time reads, this highlights unit 53 will The word content between text mark and the last text mark read currently read is highlighted, can be to second Text information is edited automatically, to getting voice between label ready having done identical and adjacent two recording in audio file Corresponding text is highlighted automatically, which can be with are as follows: runic, red etc..User is to third text When this information such as is checked, edited at the operation, highlighted content of text open-and-shut can be viewed, is improved efficiency.

Further, this highlights unit 53 and includes:

Second searching unit, when the text mark for reading in this prior is identical as the text mark that the last time reads, According to preset text mark and mode mapping table is highlighted, searches this text mark currently read is corresponding and highlight Mode；

Subelement is highlighted, for will be between the text mark currently read and the last text mark read Word content is highlighted according to the mode that highlights that second searching unit is searched, to generate third text information.

Text label can be preset according to actual needs and highlights mode mapping table, as shown in above-mentioned table two.

If the text mark currently read is red circle, the second searching unit is in the preset text mark and dashes forward It is red font that display mode mapping table, which finds the corresponding mode that highlights of the red circle, out.

This highlights subelement and highlights mode to the text currently read according to what second searching unit was found Word content between this label and the last text mark read is highlighted, can automatically to the second text information into Edlin generates third text information.It, can be very clear when user the operation such as checks, edits to the third text information View highlighted content of text, improve efficiency.

The above is only a preferred embodiment of the present invention, is not intended to limit the scope of the invention, all to utilize this hair Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills Art field, is included within the scope of the present invention.

Claims

1. a kind of method of speech-to-text conversion, which is characterized in that this method comprises:

Obtain audio file；

The voice contained in the audio file is converted into text according to the time shaft of audio file sequence to generate the One text information；

Label is got ready in the position of audio file according to the corresponding recording of the text mark, and the text mark is inserted into institute The corresponding position in the first text information is stated, to generate the second text information, wherein second text information had both included by language The text that sound is converted into, and include the text mark getting label ready by recording and being converted into.

2. the method for speech-to-text as described in claim 1 conversion, which is characterized in that it is described will be in the audio file Recording gets the step of label is converted to text mark ready and includes:

It obtains the recording in the audio file and gets label ready；

Label and text mark mapping table are got ready according to preset recording, and the corresponding text of label is got in the recording for searching the acquisition ready This label.

3. the method for speech-to-text conversion as claimed in claim 2, which is characterized in that be inserted by the text mark In first text information, after the step of the second text information of generation, this method further include:

Word content between identical and adjacent two text mark in second text information is highlighted, To generate third text information.

4. the method for speech-to-text conversion as claimed in claim 3, which is characterized in that described by second text information In identical and adjacent two text mark between word content highlighted, to generate the step of third text information Suddenly include:

Sequence reads second text information；

If currently reading text mark, the text the mark whether text mark currently read reads with the last time is judged Remember identical；

If the text mark currently read is identical as the text mark that the last time reads, by the text currently read Word content between label and the last text mark read is highlighted, to generate third text information.

5. the method for speech-to-text as claimed in claim 4 conversion, which is characterized in that it is described, it is currently read described Word content between text mark and the last text mark read is highlighted, to generate third text information Step includes:

According to preset text mark and mode mapping table is highlighted, searches the corresponding protrusion of the text mark currently read Display mode；

By the word content between the text mark currently read and the last text mark read according to the lookup The mode that highlights highlighted, to generate third text information.

6. a kind of device of speech-to-text conversion characterized by comprising

Module is obtained, for obtaining audio file；

First generation module, for being turned the voice contained in the audio file according to the time shaft sequence of the audio file Text is changed to generate the first text information；

Second generation module, for getting label ready in the position of audio file, by institute according to the corresponding recording of the text mark The corresponding position that text mark is inserted into first text information is stated, to generate the second text information, wherein described second Text information had not only included the text being converted by voice, but also included the text mark getting label ready by recording and being converted into.

7. the device of speech-to-text as claimed in claim 6 conversion, which is characterized in that first conversion module includes:

First searching unit searches the record of the acquisition for getting label and text mark mapping table ready according to preset recording Sound gets the corresponding text mark of label ready.

8. the device of speech-to-text conversion as claimed in claim 7, which is characterized in that the device further include:

Third generation module, for by the text between identical and adjacent two text mark in second text information Content is highlighted, to generate third text information.

9. the device of speech-to-text as claimed in claim 8 conversion, which is characterized in that the third generation module includes:

Reading unit, for sequentially reading second text information；

Judging unit, for when the reading unit currently reads text mark, judging the text mark currently read Whether note is identical as the text mark that the last time reads；

Unit is highlighted, for inciting somebody to action when the text mark currently read is identical as the upper text mark once read Word content between the text mark currently read and the last text mark read is highlighted, to generate Third text information.

10. the device of speech-to-text as claimed in claim 9 conversion, which is characterized in that the unit that highlights includes:

Second searching unit, for when the text mark currently read is identical as the upper text mark once read, root According to preset text mark and mode mapping table is highlighted, searches the corresponding side of highlighting of the text mark currently read Formula；

Subelement is highlighted, the text between text mark for reading the text mark currently read and last time Word content is highlighted according to the mode that highlights that second searching unit is searched, to generate third text information.