CN111835529A

CN111835529A - Voice processing method and device

Info

Publication number: CN111835529A
Application number: CN201910329441.9A
Authority: CN
Inventors: 廖彬彬; 计华国; 唐启明
Original assignee: Hytera Communications Corp Ltd
Current assignee: Hytera Communications Corp Ltd
Priority date: 2019-04-23
Filing date: 2019-04-23
Publication date: 2020-10-27

Abstract

The invention provides a voice processing method and a voice processing device, which can convert voice content to be processed into text content after obtaining the voice content to be processed, speaker identification information associated with the voice content to be processed and binding time, associate the text content with the speaker identification information and the binding time, determine the position of the text content in a conversation record based on the binding time, and add the text content to the position in the conversation record, so that the conversation record can record the text contents of a plurality of speakers sequentially by taking the binding time as an axis, and thus, conversation processes of different periods of conversation in which the speakers participate and the speaking time of each speaker can be displayed through the conversation record. And the text content is associated with the speaker identification information and the time of the binding, so that after the session record is obtained, retrieval can be performed based on at least one of the speaker identification information and the time of the binding, and the required text content can be obtained from the session record.

Description

Voice processing method and device

Technical Field

The present invention belongs to the field of data conversion, and more particularly, to a method and an apparatus for processing speech.

Background

In the conference process, a session recorder is required to record the session at the side, but the manual recording of the conference content by the session recorder can exist: the recording speed of the conversation recorder cannot keep up with the progress of the conference, the conversation cannot be heard clearly and the like, so that the conversation recording is incomplete, and meanwhile, manual recording wastes time and labor.

For the manual recording mode, the voice content in the conference process can be collected through the terminal of the conference place at present, and the voice content is converted into the text content through the voice-text conversion tool, so that the automatic recording of the session recording is realized.

Disclosure of Invention

In view of the above, the present invention provides a method and an apparatus for processing speech, which are used to quickly locate text content in a session record.

The invention provides a voice processing method, which comprises the following steps:

obtaining voice content to be processed;

obtaining speaker identification information associated with the voice content to be processed and the time for binding the voice content to be processed;

converting the voice content to be processed into text content, and associating the text content with the speaker identification information and the binding time;

and determining the position of the text content in a conversation record based on the bound time, and adding the text content at the position in the conversation record, so that the conversation record sequentially records the text contents of a plurality of speakers at the bound time.

Preferably, the obtaining of the speaker identification information associated with the to-be-processed content and the time for binding the to-be-processed voice content includes:

if it is monitored that a specific control in a terminal corresponding to a speaker is triggered, taking identification information of the terminal as speaker identification information associated with the content to be processed;

and taking the time when a specific control in the terminal is triggered as the time for binding the voice content to be processed.

Preferably, the time when the specific control in the terminal is triggered is as follows: and the speaker records one of the time of starting speaking and the time of finishing speaking by the terminal and the other time together with the text content in the conversation record, and the speaker triggers the specific control when starting speaking and finishing speaking by the terminal.

Preferably, the method further comprises: and setting a playing control for each text content in the session record, wherein the playing control set for any text content is bound with the to-be-processed voice content of the text content.

Preferably, the method further comprises:

obtaining a keyword for searching the session record;

obtaining text content corresponding to the keywords from the session record;

and displaying the text content corresponding to the keyword by taking the time associated with the text content corresponding to the keyword as an axis.

Preferably, the method further comprises: and displaying words similar to the keywords and/or displaying words similar to the keywords in a special form in the text content corresponding to the keywords.

The present invention also provides a speech processing apparatus, the apparatus comprising:

the first obtaining module is used for obtaining the voice content to be processed;

a second obtaining module, configured to obtain speaker identification information associated with the to-be-processed voice content and time for binding the to-be-processed voice content;

the conversion module is used for converting the voice content to be processed into text content and associating the text content with the speaker identification information and the binding time;

and the determining module is used for determining the position of the text content in the conversation record based on the binding time, and adding the text content at the position in the conversation record, so that the conversation record sequentially records the text contents of a plurality of speakers at the binding time.

Preferably, the second obtaining module includes:

the monitoring unit is used for taking the identification information of the terminal as the identification information of the speaker associated with the content to be processed if the situation that a specific control in the terminal corresponding to the speaker is triggered is monitored;

and the binding unit is used for taking the time when the specific control in the terminal is triggered as the time for binding the voice content to be processed.

Preferably, the apparatus further comprises:

and the setting module is used for setting a playing control for each text content in the session record, and the playing control set by any text content is bound with the voice content to be processed of the text content.

Preferably, the apparatus further comprises:

a third obtaining module, configured to obtain a keyword for retrieving the session record;

a fourth obtaining module, configured to obtain, from the session record, text content corresponding to the keyword;

and the first display module is used for displaying the text content corresponding to the keyword by taking the time associated with the text content corresponding to the keyword as an axis.

Preferably, the apparatus further comprises:

and the second display module is used for displaying the keywords and/or words similar to the keywords in a special form in the text content corresponding to the keywords.

The invention also provides a processing device having a memory with one or more programs stored therein and a processor, the one or more programs when executed by the processor implementing the above method.

The present invention also provides a computer readable storage medium having stored therein one or more programs which, when executed on at least one processor, implement the above-described method.

According to the technical scheme, after the voice content to be processed, the speaker identification information associated with the voice content to be processed and the binding time are obtained, the voice content to be processed can be converted into the text content, the text content is associated with the speaker identification information and the binding time, the position of the text content in the conversation record is determined based on the binding time, the text content is added to the position in the conversation record, the conversation record can record the text contents of a plurality of speakers sequentially by taking the binding time as an axis, and therefore the conversation progress of different periods of conversation in which the speakers participate in the conversation and the speaking time of each speaker can be displayed through the conversation record. And the text content is associated with the speaker identification information and the binding time, so that after the session record is obtained, retrieval can be carried out based on at least one of the speaker identification information and the binding time, and the text content in the session record can be quickly positioned based on the speaker identification information and the binding time, so that the required text content can be obtained from the session record.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flow chart of a method for processing speech according to an embodiment of the present invention;

fig. 2 is a schematic diagram of an application scenario corresponding to a speech processing method provided in an embodiment of the present invention;

fig. 3 is a schematic diagram illustrating a binding between a play control and a to-be-processed voice content according to an embodiment of the present invention;

FIG. 4 is a flow chart of another speech processing method provided by the embodiments of the present invention;

FIG. 5 is a diagram illustrating a keyword search according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a speech processing apparatus according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of another speech processing apparatus according to an embodiment of the present invention.

Detailed Description

The invention provides a voice processing method and a voice processing device, which relate text content with speaker identification information and binding time, and record the text content of a plurality of speakers by taking the binding time as an axis sequence, so that the content of a conversation process and the content of the conversation process in different time periods are clearly shown.

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, a flowchart of a speech processing method provided in an embodiment of the present invention is provided, where an execution device of the speech processing method may be a server or other devices with a speech-to-text conversion function, and the method includes the following steps:

s101, obtaining the voice content to be processed. It can be understood that: the voice content to be processed is the voice content which needs to be subjected to text conversion, and can be obtained in a conference of two speakers or a plurality of speakers or in a voice passing process. The voice content to be processed can be acquired by the terminal used by the speaker and then transmitted to the executing device (such as a server) of the voice processing method by the terminal used by the speaker.

S102, obtaining speaker identification information associated with the voice content to be processed and the time for binding the voice content to be processed.

In this embodiment, the speaker identification information associated with the to-be-processed voice content is used to indicate a speaker corresponding to the to-be-processed voice content, so as to indicate, through the speaker identification information, which speaker the to-be-processed voice content is given by, that is, a session spoken by each speaker corresponds to this speaker, and a one-to-one relationship is formed, for example, in a conference in which a plurality of speakers participate, the execution apparatus of this embodiment may obtain a plurality of to-be-processed voice contents, and the plurality of to-be-processed voice contents may not only be spoken by one speaker but may be spoken by a plurality of speakers, and at this time, if the speakers and the voices of the speakers cannot be bound, it is unclear which speaker speaks which session.

In order to be able to distinguish the speakers, one expression of the speaker identification information in the present embodiment is: the identification information of the terminal that collects the pending voice content of the speaker is because the terminal has a unique identification, for example, the IEMI (International Mobile Equipment Identity) of different terminals is different, and thus the IEMI of the terminal can be used as the speaker identification information. For example, the execution device of this embodiment performs real-name binding on the terminal, and corresponds the identification information of the terminal to the speaker, so that the identification information of the terminal serving as the identification information of the speaker can be obtained while obtaining the to-be-processed voice content, and the to-be-processed voice content and the speaker are in one-to-one correspondence, so that a situation that which speaker speaks which speech is unclear does not occur. Besides the identification information of the terminal, the speaker identification information may be implemented in the following manners: the pitch, the tone intensity, the tone length, the tone color and the like of the sound emitted by each speaker can be stored in advance, and when the speaker emits the voice, only the pitch, the tone intensity, the tone length and the tone color need to be recognized, so that the speaker can confirm the words spoken.

Besides the speaker identification information of the voice content to be processed, the time for binding the voice content to be processed is obtained, where the time for binding the voice content to be processed is used to indicate the occurrence time of the voice content to be processed, for example, the time for binding the voice content to be processed may be: the time when the voice content to be processed starts and/or the time when the voice content to be processed ends.

The following description is given by taking a special terminal as an example, which is special because the terminal has a special control, and when a speaker needs to speak, the special control needs to be triggered to enable the speaker to obtain the right of speaking. For example, the special terminal may be a terminal having a Push-To-Talk (PTT) button, and the PTT button is pressed when a speaker needs To speak, and the PTT button is released when the speaker finishes speaking, so that if the PTT button is pressed, it is indicated that the speaker starts speaking, the terminal starts To collect the To-be-processed voice content, and if the PTT button is released, it is indicated that the speaker finishes speaking, and at this time, the terminal collects a complete section of To-be-processed voice content, and the terminal completes collection of the To-be-processed voice content.

In this embodiment, it can be regarded that the specific control is triggered when the PTT button is pressed or released, which means that the terminal can collect the pending voice content of the speaker, so that the identification information of the terminal can be used as the speaker identification information associated with the pending content, and the time when the specific control in the terminal is triggered can be used as the time for binding the pending voice content.

For example, when the speaker is ready to start speaking, a specific control in the terminal is triggered, namely a PTT button is pressed, and the speaker is considered to start speaking, the time for starting speaking can be taken as the time for triggering the specific control, or when the speaker finishes speaking, the specific control is triggered again, and when the PTT button is released, and the speaker is considered to finish speaking, the time for finishing speaking can be taken as the time for triggering the specific control.

If one of the time when the speaker starts speaking and the time when the speaker ends speaking by means of the terminal is taken as the time when a specific control in the terminal is triggered, the other time can be recorded simultaneously, for example, the text content obtained by subsequent conversion is recorded in a conversation record, for example, if the time when the speaker starts speaking is taken as the time when the specific control is triggered, the time when the speaker ends speaking and the text content are recorded in the conversation record, and the condition for triggering the specific control can be that the speaker triggers the specific control when the speaker starts speaking and ends speaking by means of the terminal.

Taking an example of a conference in which a plurality of speakers perform a conference by means of terminals having PTT keys, as shown in fig. 2, in the conference in which a plurality of speakers participate, a plurality of terminals are divided into a plurality of groups (two groups are exemplified in fig. 2), and the plurality of terminals can communicate with one server to obtain, through the server, to-be-processed voice contents transmitted by the plurality of terminals, respectively.

When a speaker in the conference starts speaking, the speaker presses a PTT button of a terminal used by the speaker, and at this time, a speech related device (not shown in fig. 2) in a group in which the terminal is located can accurately recognize the current speaker and the time when the speaker starts speaking through PTT talk burst signaling, and if the speaker finishes speaking, the speech related device can also recognize the time when the speaker finishes speaking, so that the to-be-processed speech content, the speaker identification information associated with the to-be-processed speech content, and the time when the to-be-processed speech content is bound can be obtained through the speech related device.

The points to be explained here are: the group in which the terminal with the PTT button is located is half-duplex, only one terminal can obtain the PTT right at each moment in one group, and when any terminal in the group obtains the PTT right, the PTT buttons of other terminals can not apply for the PTT right even if being pressed. And for a plurality of groups, a plurality of voice contents to be processed can be sent to the server at the same time (sent in a voice stream manner), because the media parameters (voice stream receiving ports) used by the server for different groups are different, so that the group from which the voice stream comes can be known through the media parameters, and the speaker corresponding to which terminal in the group is currently speaking can be known according to the PTT speaking right.

The terminal with the PTT key can acquire the speaker identification information, the binding time and the trigger terminal to acquire the voice content to be processed under the condition that the PTT key is triggered, so that various types of information can be acquired in the same voice stream, the correspondence between the voice content to be processed and the speaker identification information and the binding time is realized, and the content of the speaker speaking at what time can be more clearly known through the two information.

And S103, converting the voice content to be processed into text content, and associating the text content with the speaker identification information and the binding time.

Wherein, converting the voice content to be processed into the text content is to: the voice content, namely the speaking voice, sent by the speaker is expressed in the form of characters, and there is a mature technology for converting the voice content to be processed into text content, and the technology is not further described here. And when the voice content to be processed is converted into the text content, associating the text content with the identification information of the speaker and the binding time, so that the text content converted from the voice content to be processed not only corresponds to the speaker, but also corresponds to the binding time (such as the time of starting speaking or the time of finishing speaking), thereby determining which speaker speaks at which time.

And S104, determining the position of the text content in the conversation record based on the binding time, and adding the text content to the position in the conversation record so that the conversation record sequentially records the text contents of a plurality of speakers at the binding time.

That is to say, the session record is to record the text contents of multiple speakers in sequence by taking time as an axis, so that the whole session process can be understood at a glance through the session record, as shown in fig. 3, taking the time of starting speaking as an example of the time of binding, each text content is recorded in sequence based on the time of starting speaking, so that the ordering (a form of representation of positions) of the text contents in the session record can be determined based on the time of binding, and the text contents are added at the ordering, so that the ordering between the text contents conforms to the sequence of the bound time of starting speaking. One point to be pointed out here is: a text content has a time for starting to speak and a time for ending to speak, and therefore, when determining the position of the text content, it is necessary to use the same time, such as the time for starting to speak or the time for ending to speak, and the time for starting to speak and the time for ending to speak cannot be mixed.

In addition, in this embodiment, the speech processing method further includes: and setting a playing control for each text content in the session record, wherein the playing control set by any text content is bound with the to-be-processed voice content of the text content.

That is to say, in this embodiment, after the to-be-processed voice content is converted into the text content, the to-be-processed voice content is not discarded, but the to-be-processed voice content is stored together with the text content, and the to-be-processed voice content is bound with the play control set at the text content corresponding to the to-be-processed voice content, and the to-be-processed voice content can be broadcasted through the play control, so that the to-be-processed voice content and the corresponding text content are combined with each other, a user can conveniently look up the to-be-processed voice content in combination with the text content, and the user can have a sense of the atmosphere of the conference at that time in a listening and watching manner.

As shown in fig. 3, a form of binding the play control with the to-be-processed voice content is shown, which is explained in conjunction with fig. 3 below. From fig. 3, in a session record of a certain day of a certain month, a speaker, text content of the speaker speaking, and a time of the speaker speaking are recorded, and each text content has a play control bound to the to-be-processed voice content, for example, if a play control is added on the same line as that where each speaker identification information is located in fig. 3, the form of the play control may be various, for example, in fig. 3, the play control is represented by a circle in which a triangle is located, if it is monitored that the play control is triggered, the to-be-processed voice content bound to the play control can be broadcasted if clicked, and a user looking up the session content during the broadcasting process can listen to and see the combination, thereby enhancing the in-depth understanding of the session content.

According to the technical scheme, the playing control is convenient for a user to look up the voice content to be processed by combining the text content, so that the user can realize the atmosphere of the current conference in a close-fitting manner in an audible and visual manner, and the deep understanding of the conversation content is enhanced.

As shown in fig. 4, which is a flow chart illustrating another voice processing method provided by the embodiment of the present invention, the use of the session record after obtaining the session record may include the following steps:

s101, obtaining the voice content to be processed.

S105: keywords for retrieving session records are obtained. Wherein the keywords may be manually entered by the user, recognized from the speech content obtained at the time of retrieval, or obtained by means of a gesture (e.g., a gesture representing at least one word), while the keywords may be manually entered by the user by means of an interface for entering the keywords, the interface having a text box for entering the keywords in the text box.

S106: and obtaining the text content corresponding to the keyword from the session record. The text content corresponding to the keyword may be: at least one of text content containing keywords, text content containing words similar to keywords, wherein the words similar to keywords may be, but are not limited to: synonyms of keywords, and the like.

S107: the text content corresponding to the keywords is displayed by taking the time associated with the text content corresponding to the keywords as an axis, so that the session records not only arrange the text contents in the order of time, but also sequence the searched text contents in the order of time during searching, and the occurrence time of the keywords can be clear based on the time during displaying.

Taking the example of manually inputting the keywords by the user as an example, as shown in the interface shown in fig. 5, there is a text box in the interface, so that the user inputs the keywords into the text box, for example, inputs a "crime" keyword into the text box, and correspondingly extracts text contents corresponding to the keywords from all text contents of the session record, for example, text contents including the keyword "crime", if a plurality of text contents all include the keyword "crime", all of the plurality of text contents are displayed in the interface shown in fig. 5, and one possible way of the display is to sequentially display the text contents corresponding to the keywords in the order of time according to the time associated with the text contents.

There is also a possibility for this, but not limited to, to be possible: the text contents can be ranked according to the occurrence frequency of the keyword "crime" in the text contents, for example, according to the ranking of the occurrence frequency, specifically, the text contents with more occurrence frequency are ranked before the text contents with less occurrence frequency, and the text contents are ranked in turn in the same way, so that the important text contents can be positioned better, because if a word is repeatedly lifted and emphasized to show that the speaker pays great attention to the content spoken by the word, the text contents are ranked according to the keyword with more occurrence frequency, so that the user who refers to the conversation record can achieve the purpose that the user wants to retrieve the desired text contents.

In addition, in the process of presenting the text content corresponding to the keywords, the keywords and/or words similar to the keywords may also be presented in a special form, where the special form is used to indicate positions of the keywords and/or words similar to the keywords in the text content in a striking manner, so as to remind the user of focusing attention on the keywords and/or words similar to the keywords and adjacent content in the striking manner, because the content in the positions is content that needs to be focused on by the user.

In this embodiment, the reminding manner may be, but is not limited to: for example, the display mode of the keyword and/or the word similar to the keyword is changed, for example, at least one of the font color, the font bold and the font slant is changed, for example, the font of the keyword, which is "crime", is bold and slant in fig. 5, and the text content identical to the keyword can be obviously retrieved from fig. 5, except for the above-mentioned mode, the user who refers to the session record can be prompted better, and in addition, a jitter effect can be added to the keyword and/or the word similar to the keyword in other modes, so that the user who refers to the session record can be prompted better, and the specific setting can be set according to the actual application, and will not be described herein.

According to the technical scheme, one section of conversation record is displayed in a mode that time is used as an axis, the time when each section of voice content to be processed starts (namely the time when the voice content starts to speak), a speaker and the text content can be clearly seen, a user looking up the conversation record can quickly locate the text content expected to be seen through searching keywords, and the user can also directly listen to and play the interested voice content to be processed.

While, for purposes of simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present invention is not limited by the illustrated ordering of acts, as some steps may occur in other orders or concurrently with other steps in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

Corresponding to the foregoing method embodiment, an embodiment of the present invention further provides a speech processing apparatus, whose structure is shown in fig. 6, and may include: a first obtaining module 10, a second obtaining module 11, a converting module 12 and a determining module 13.

A first obtaining module 10, configured to obtain to-be-processed speech content, it is understood that: the voice content to be processed is the voice content which needs to be subjected to text conversion, and can be obtained in a conference of two speakers or a plurality of speakers or in a voice passing process. The voice content to be processed can be acquired by the terminal used by the speaker and then sent to the first obtaining module 10 by the terminal used by the speaker.

And a second obtaining module 11, configured to obtain speaker identification information associated with the to-be-processed voice content and time for binding the to-be-processed voice content. The speaker identification information associated with the to-be-processed voice content is used to indicate a speaker corresponding to the to-be-processed voice content, so that the speaker identification information indicates which speaker gives the to-be-processed voice content, that is, a conversation spoken by each speaker corresponds to the speaker, and a one-to-one relationship is formed.

One manifestation of the speaker identification information in this embodiment is: the identification information of the terminal that collects the pending voice content of the speaker is because the terminal has a unique identification, for example, the IEMI (international mobile Equipment Identity) of different terminals is different, and thus the IEMI of the terminal can be used as the speaker identification information.

There is another expression as the speaker identification information, for example: the pitch, the tone intensity, the tone length, the tone color and the like of the sound emitted by each speaker can be stored in advance, and when the speaker emits the voice, only the pitch, the tone intensity, the tone length and the tone color need to be recognized, so that the speaker can confirm the words spoken.

In the above embodiment of the apparatus, a special terminal is taken as an example for explanation, which is special because the terminal has a special control, when a speaker needs to speak, the special control needs to be triggered to enable the speaker to obtain the speaking right, and for the special terminal, an optional structure of the second obtaining module 11 is as follows: the second obtaining module 11 includes: a monitoring unit and a binding unit.

And the monitoring unit is used for taking the identification information of the terminal as the identification information of the speaker associated with the content to be processed if the situation that the specific control in the terminal corresponding to the speaker is triggered is monitored.

The binding unit is used for taking the time when the specific control in the terminal is triggered as the time for binding the voice content to be processed, wherein the time when the specific control in the terminal is triggered is as follows: the speaker starts speaking and finishes speaking by the terminal, and records the other time and the text content in the conversation record.

For the relevant example of the monitoring unit and the binding unit, please refer to the relevant method embodiment and do not describe here, the monitoring unit and the binding unit can obtain the speaker identification information, the binding time and the trigger terminal to collect the voice content to be processed at the same time when the terminal with the PTT button is triggered, so that multiple types of information can be obtained in the same voice stream, and thus, the correspondence between the voice content to be processed and the speaker identification information and the binding time is realized, so that the content of what the speaker speaks at what time can be known more clearly through the two pieces of information.

And the conversion module 12 is configured to convert the to-be-processed voice content into text content, and associate the text content with the speaker identification information and the binding time. Wherein, converting the voice content to be processed into the text content is to: the voice content, namely the speaking voice, sent by the speaker is expressed in the form of characters, and there is a mature technology for converting the voice content to be processed into text content, and the technology is not further described here. And when the voice content to be processed is converted into the text content, associating the text content with the identification information of the speaker and the binding time, so that the text content converted from the voice content to be processed not only corresponds to the speaker, but also corresponds to the binding time (such as the time of starting speaking or the time of finishing speaking), thereby determining which speaker speaks at which time.

And the determining module 13 is configured to determine a position of the text content in the conversation record based on the bound time, and add the text content to the position in the conversation record, so that the conversation record sequentially records the text contents of the multiple speakers at the bound time. The conversation record is used for recording the text contents of a plurality of speakers in sequence by taking time as an axis, so that the whole conversation process can be clear at a glance through the conversation record. For the related description of the session record, refer to the related method embodiment, which is not described herein.

In addition, the embodiment of the device also comprises: and the setting module is used for setting a playing control for each text content in the session record, and the playing control set by any text content is bound with the voice content to be processed of the text content.

According to the technical scheme, the voice content to be processed and the playing control arranged at the corresponding text content are bound, the voice content to be processed can be broadcasted through the playing control, so that the voice content to be processed and the corresponding text content are combined with each other, a user can conveniently look up the voice content to be processed by combining the text content, and the user can feel the atmosphere of the conference at that time in a listening and watching mode.

As shown in fig. 7, which illustrates another speech processing apparatus provided in the embodiment of the present invention, on the basis of fig. 6, the speech processing apparatus may further include: a third obtaining module 14, a fourth obtaining module 15 and a first display module 16.

A third obtaining module 14, configured to obtain a keyword for retrieving the session record. Wherein the keywords may be manually entered by the user, recognized from the speech content obtained at the time of retrieval, or obtained by means of a gesture (e.g., a gesture representing at least one word), and the keywords may be manually entered by the user by means of an interface for entering the keywords, the interface having a text box for entering the keywords in the text box.

And a fourth obtaining module 15, configured to obtain, from the session record, text content corresponding to the keyword. The text content corresponding to the keyword may be: at least one of text content containing keywords, text content containing words similar to keywords, wherein the words similar to keywords may be, but are not limited to: synonyms of keywords, and the like.

The first display module 16 is configured to display the text content corresponding to the keyword with time associated with the text content corresponding to the keyword as an axis, so that the session records not only arrange the text contents in order of time, but also sequence the retrieved text contents in order of time during retrieval, and thus the occurrence time of the keyword can be clear based on time during display.

In addition, the embodiment of the device also comprises: and the second display module is used for displaying the keywords and/or the words similar to the keywords in the text content corresponding to the keywords in a special form, wherein the special form is used for representing the positions of the keywords and/or the words similar to the keywords in the text content in a striking manner so as to remind the user of paying attention to the keywords and/or the words similar to the keywords and adjacent content in the striking manner, and the content at the positions is the content which needs the attention of the user. The specific reminding manner refers to the illustration of the related method embodiment, which is not described herein.

In addition, an embodiment of the present invention further provides a processing device, where the processing device has a memory and a processor, the memory stores one or more programs, and the processor executes the one or more programs to implement the above-mentioned voice processing method.

An embodiment of the present invention further provides a computer-readable storage medium, in which one or more programs are stored, and when the one or more programs are executed on at least one processor, the method for processing speech is implemented.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method of speech processing, the method comprising:

obtaining voice content to be processed;

2. The method of claim 1, wherein obtaining speaker identification information associated with the pending voice content and a time at which the pending voice content is bound comprises:

3. The method according to claim 2, wherein the time when the specific control in the terminal is triggered is: and the speaker records one of the time of starting speaking and the time of finishing speaking by the terminal and the other time together with the text content in the conversation record, and the speaker triggers the specific control when starting speaking and finishing speaking by the terminal.

4. The method of claim 1, further comprising: and setting a playing control for each text content in the session record, wherein the playing control set for any text content is bound with the to-be-processed voice content of the text content.

5. The method of claim 1, further comprising:

obtaining a keyword for searching the session record;

obtaining text content corresponding to the keywords from the session record;

6. The method of claim 5, further comprising: and displaying words similar to the keywords and/or displaying words similar to the keywords in a special form in the text content corresponding to the keywords.

7. A speech processing apparatus, characterized in that the apparatus comprises:

8. The apparatus of claim 7, wherein the second obtaining module comprises:

9. A processing device having a memory and a processor, the memory having one or more programs stored therein which when executed by the processor implement the method of any of claims 1-6.

10. A computer readable storage medium having one or more programs stored therein which when executed on at least one processor implement the method of any of claims 1-6.