CN114373464A

CN114373464A - Text display method and device, electronic equipment and storage medium

Info

Publication number: CN114373464A
Application number: CN202111669749.1A
Authority: CN
Inventors: 马茂斐
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2022-04-19

Abstract

The application discloses a text display method, a text display device, electronic equipment and a storage medium, wherein the text display method comprises the following steps: displaying a session interface of the voice session; based on the interface content of the session interface, identifying speaker information of the voice session; recognizing speech of the speech conversation to obtain a recognition text corresponding to the speech; and displaying the identification text after associating the identification text with the speaker information. The method can realize the associated display of the text and the speaker of the voice recognition in the voice conversation without inputting the voiceprint information of the conversation object, and improves the user experience.

Description

Text display method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of electronic devices, and in particular, to a text display method and apparatus, an electronic device, and a storage medium.

Background

With the rapid progress of the technology level and the living standard, electronic devices (such as smart phones, tablet computers, etc.) have become one of the electronic products commonly used in people's lives. Since electronic devices have a communication function, people often use electronic devices to perform voice conversations (e.g., voice conferences, voice chatting, video calls, etc.).

Disclosure of Invention

The application provides a text display method, a text display device, electronic equipment and a storage medium, which can conveniently realize the associated display of the text content of voice recognition and speaker information in voice conversation.

In a first aspect, an embodiment of the present application provides a text display method, where the method includes: displaying a session interface of the voice session; based on the interface content of the session interface, identifying speaker information of the voice session; recognizing speech of the speech conversation to obtain a recognition text corresponding to the speech; and displaying the identification text after associating the identification text with the speaker information.

In a second aspect, an embodiment of the present application provides a text display apparatus, where the apparatus includes: the system comprises an interface display module, a user identification module, a voice identification module and an identification output module, wherein the interface display module is used for displaying a conversation interface of a voice conversation; the user identification module is used for identifying the speaker information of the voice conversation based on the interface content of the conversation interface; the voice recognition module is used for recognizing the speaking voice of the voice conversation to obtain a recognition text corresponding to the speaking voice; and the recognition output module is used for displaying the recognition text after associating the recognition text with the speaker information.

In a third aspect, an embodiment of the present application provides an electronic device, including: one or more processors; a memory; one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications being configured to perform the text presentation method provided in the first aspect above.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where a program code is stored in the computer-readable storage medium, and the program code may be called by a processor to execute the text presentation method provided in the first aspect.

According to the scheme, the speaker information of the voice conversation is identified by displaying the conversation interface of the voice conversation and based on the interface content of the conversation interface, the speaking voice of the voice conversation is identified, the identification text corresponding to the speaking voice is obtained, and then the identification text is associated with the speaker information and displayed. Therefore, when the electronic equipment is used for voice conversation, the interface content of the conversation interface is used for displaying the text corresponding to the speech and the speaker corresponding to the speech after the text corresponding to the speech is associated with the speaker, the problem that a user inputs voiceprint information of a conversation object in advance is solved, and user experience is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram illustrating an application environment provided by an embodiment of the present application.

FIG. 2 shows a flow diagram of a text presentation method according to one embodiment of the present application.

Fig. 3 shows an interface schematic diagram provided in an embodiment of the present application.

FIG. 4 shows a flow diagram of a text presentation method according to another embodiment of the present application.

Fig. 5 shows a flowchart of step S220 in the text presentation method provided in the embodiment of the present application.

Fig. 6 shows another flowchart of step S220 in the text presentation method provided in the embodiment of the present application.

FIG. 7 shows a flow diagram of a text presentation method according to yet another embodiment of the present application.

FIG. 8 shows a flow diagram of a text presentation method according to yet another embodiment of the present application.

Fig. 9 shows another interface schematic diagram provided in the embodiment of the present application.

Fig. 10 shows a schematic view of another interface provided in the embodiment of the present application.

Fig. 11 shows a schematic view of another interface provided in an embodiment of the present application.

Fig. 12 shows yet another interface schematic provided by an embodiment of the present application.

Fig. 13 shows a schematic diagram of yet another interface provided in an embodiment of the present application.

FIG. 14 shows a flow diagram of a method of text presentation according to yet another embodiment of the present application.

FIG. 15 shows a block diagram of a text presentation device according to one embodiment of the present application.

Fig. 16 is a block diagram of an electronic device for executing a text presentation method according to an embodiment of the present application.

Fig. 17 is a storage unit according to an embodiment of the present application, configured to store or carry program code for implementing a text presentation method according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

In some scenarios, when a user is conducting a voice conversation using an electronic device, there are situations where it is desirable to output the speech in the voice conversation as text. Moreover, when a user needs to view a text corresponding to a voice, the user also needs to view a conversation object to which the voice corresponding to the converted text belongs, because a plurality of conversation objects are usually included in the voice conversation.

In the related art, voiceprint information of different session objects in a voice session is usually pre-recorded, so that speaking voices corresponding to the different session objects are recognized in the voice session, the speaking voices corresponding to the different session objects are converted into texts and are output in a correlated manner with the session objects, and the requirement of a user for viewing the texts after voice conversion corresponding to the different session objects is met. However, in such a manner, voiceprint information of the session object needs to be bound with the session object in advance, and the binding process needs to be performed by a user through complicated operations; moreover, the session object may not be the same every voice session, and the voiceprint information bound during the previous voice session may not be applicable when the next voice session is performed. Therefore, the method has the problems of complicated operation and insufficient flexibility.

In view of the above problems, the inventor provides a text display method, a text display device, an electronic device, and a storage medium according to embodiments of the present application, so that when an electronic device is used for a voice conversation, a text corresponding to a speech and a speaker corresponding to the speech can be associated and displayed by using interface contents of a conversation interface, thereby avoiding a cumbersome problem that a user previously enters voiceprint information of a conversation object, and improving user experience. The specific text display method is explained in detail in the following embodiments.

The following first introduces a scenario related to an embodiment of the present application.

As shown in fig. 1, a plurality of electronic devices 100 (only 2 are shown in the figure) are included in the scenario shown in fig. 1. Among them, a plurality of electronic devices 100 can communicate with each other through a network to realize a voice conversation of a plurality of people, such as voice conference, voice chat, video conference, video chat, and the like. When the electronic device 100 performs a voice conversation, a conversation interface can be displayed, a speaker is identified based on interface content of the conversation interface, the speaking voice of the voice conversation is identified, an identification text corresponding to the speaking voice is obtained, and then the identification text is associated with speaker information and then displayed, so that the text identified by the voice and the speaker are displayed in an associated manner without recording a voiceprint of a conversation object, the problem of complex user operation in the related art is avoided, and user experience is improved.

The text display method provided by the embodiment of the application is described in detail below with reference to the accompanying drawings.

Referring to fig. 2, fig. 2 is a schematic flowchart illustrating a text presentation method according to an embodiment of the present application. In a specific embodiment, the text presentation method is applied to the text presentation apparatus 400 shown in fig. 15 and the electronic device 100 (fig. 16) equipped with the text presentation apparatus 400. The following will describe a specific process of this embodiment by taking an electronic device as an example, and it is understood that the electronic device applied in this embodiment may be a smart phone, a tablet computer, a smart watch, an electronic book, and the like, which is not limited herein. As will be described in detail with respect to the flow shown in fig. 2, the text display method may specifically include the following steps:

step S110: a session interface for the voice session is displayed.

In the embodiment of the application, the electronic device can display a conversation interface of the voice conversation under the condition of the voice conversation. The voice conversation can be a conversation carried out by voice, such as voice call, voice conference, video conference, voice chat, video chat and the like; the conversation interface is displayed when the application program corresponding to the voice conversation executes the voice conversation.

In some implementations, controls for controlling the voice conversation, user identification of conversation objects, and the like can be included in the conversation interface. The control for controlling the voice session may include a control for closing the voice session, controlling a voice input volume of the voice session, controlling a voice output volume of the voice session, and the like, which is not limited herein; the user identifier may be an identifier used for characterizing the user identity, such as a head portrait of the user, a user name, and the like, and the specific user identifier may not be limited. Of course, the content included in the session interface is not limited to this, and for example, a control for managing a session object participating in a voice session may be further included, and for example, when the voice session is a remote video session, an area for displaying a video image of the remote video session may be further included.

Step S120: based on the interface content of the conversation interface, speaker information of the voice conversation is identified.

In the embodiment of the application, in the process of executing the voice conversation, the electronic device can identify the speaker information of the voice conversation based on the interface content of the conversation interface, so that the text after voice conversion is conveniently associated with the speaker information for display. It is understood that the session interface usually includes a state control, a video image of the session, and the like, which are associated with the user identification (e.g., the user avatar, the user name, and the like) of the session object and used for showing the speaking state, and by recognizing these contents, speakers at different times, i.e., the session object in the speaking state, can be obtained, and speaker information can be obtained. The speaker information may include object information of the conversation object in the speaking state, such as a user identifier that can be recognized based on interface content of the conversation interface, a character image in the video image, and the like.

In some embodiments, before the electronic device performs steps S120 to S140, it may be determined whether a function of a voice caption is currently turned on, where the function of the voice caption is used to implement character recognition on audio so as to display a text corresponding to the audio. In the case of opening the voice subtitle function, the electronic device may execute steps S120 to S140 to implement the text display method provided in the embodiment of the present application, so as to display the text corresponding to the speaking voice in the voice conversation in association with the speaker information.

In a possible implementation manner, the electronic device may further recognize whether a scene currently needing character recognition is a scene of a voice conversation under the condition that the function of the voice subtitle is turned on, and if the scene is the scene of the voice conversation, step S120 to step S140 may be performed, so that a text corresponding to speech in the voice conversation is displayed in association with the speaker information.

Step S130: and recognizing the speech of the speech conversation to obtain a recognition text corresponding to the speech.

In the embodiment of the application, when the electronic device executes the voice conversation, the speaking voice in the voice conversation can be identified, so that the identification text corresponding to the speaking voice is obtained, that is, the speaking voice is subjected to text conversion, and the obtained text content is displayed after the identification text is associated with the speaker information, so that the user can conveniently check the text content.

In some embodiments, the electronic device may acquire utterance speech of the voice conversation in real time, that is, an audio stream of the voice conversation, and recognize and convert the utterance speech into text content of a target language type based on a target language type of the speech conversion as a recognition text corresponding to the utterance speech. The target language type may be chinese, english, korean, etc., and is not limited herein. Optionally, the electronic device may send the Speech To a cloud server, and after the cloud server performs Speech-To-Text (STT) on the Speech, the cloud server returns the Text content obtained by recognition To the electronic device, so that the electronic device may obtain a recognition Text obtained by recognizing the Speech; optionally, the electronic device may also convert the speech in the voice conversation into text content through a built-in STT algorithm. Of course, the specific recognition manner of the electronic device for recognizing the speaking voice of the voice conversation may not be limited.

In one possible embodiment, the target language type may be a preset language, for example, the electronic device may set a language type selected by a language setting operation as the target language type in response to the language setting operation input by the user; the target language type may also be a default language type, for example, if the target language type is not preset, the current system language of the electronic device may be default as the target language type.

In this embodiment, the electronic device may execute the step of recognizing the speaker information of the voice conversation based on the interface content of the conversation interface and execute the step of recognizing the utterance voice of the voice conversation to obtain the recognition text corresponding to the utterance voice, where the execution order between the step S120 and the step S130 is not limited. For example, the electronic device may first execute step S120 and then execute step S130, may first execute step S130 and then execute step S120, or may execute step S120 and step S130 in parallel.

Step S140: and displaying the identification text after associating the identification text with the speaker information.

In the embodiment of the application, when the electronic device executes a voice conversation, after the speaker information and the identification text corresponding to the utterance voice are obtained through identification, the identification text and the speaker information can be associated and displayed, so that when a user views the identification text, the user can know a conversation object to which the utterance voice corresponding to the identification text belongs, and therefore the user can quickly perform text recording on utterance contents of different conversation objects in the voice conversation in some scenes. When the electronic device associates the recognition text with the speaker information and displays the recognition text, the recognition text may be associated with object information of the conversation object in the speaking state and then displayed, for example, the recognition text may be associated with a user identifier recognized in the conversation interface and then displayed.

In some implementations, the recognition text can include a plurality of pieces of text content, and the electronic device can present each piece of text content in association with object information of a corresponding conversation object. For example, referring to fig. 3, the electronic device 100 may present each piece of text content and the user name of the session object corresponding to each piece of text content in the designated area 103 of the session interface a 1.

According to the text display method, when the electronic equipment is used for voice conversation, the interface content of the conversation interface is used, the text corresponding to the speech and the speaker corresponding to the speech can be associated and displayed, the problem that a user inputs voiceprint information of a conversation object in advance is solved, and user experience is improved.

Referring to fig. 4, fig. 4 is a flowchart illustrating a text presentation method according to another embodiment of the present application. The text display method is applied to the electronic device, and will be described in detail with respect to the flow shown in fig. 4, where the text display method specifically includes the following steps:

step S210: a session interface for the voice session is displayed.

In the embodiment of the present application, step S210 may refer to the contents of other embodiments, which are not described herein again.

Step S220: and recognizing the speaking state of a conversation object of the voice conversation based on the interface image of the conversation interface.

In the embodiment of the application, when the electronic device identifies the speaker of the voice conversation based on the interface content of the conversation interface, the electronic device may identify the speaking state of the conversation object of the voice conversation based on the interface image of the conversation interface, so as to identify the conversation object in the speaking state, thereby obtaining the speaker information. It can be understood that the session interface generally includes a state control for showing a speaking state of the session object, a video image area containing a captured image of the session object, and other related contents, so that by intercepting the interface image of the session interface and by identifying the related contents in the interface image, the speaking state of the session object in the voice session can be identified.

In some embodiments, the interface image of the conversation interface includes a state identifier for characterizing a speaking state of a conversation object in the voice conversation, and a user can determine the conversation object currently in the speaking state based on the state identifier when viewing the conversation interface. Optionally, the session interface includes a user identifier (e.g., a user avatar, a user name, etc.) corresponding to each session object and a state identifier at a position adjacent to the user identifier, where when the state identifier is in the first display state, it indicates that the session object corresponding to the user identifier is in a speaking state; and when the state identifier is in the second display state, the session object corresponding to the user identifier is not in the speaking state. For example, referring to fig. 3 again, the user identifier may be a user avatar, a "horn" shaped state identifier may be displayed at a position adjacent to each user avatar, the state identifier may be displayed in a "horn playing" display state (e.g., a display state of the state identifier corresponding to the user 2 in fig. 3) when the conversation object corresponding to the user avatar is in the speaking state, and the state identifier may be displayed in a "horn not playing" display state (e.g., a display state of the state identifiers corresponding to the user 1 and the user 2 in fig. 3) when the conversation object corresponding to the user avatar is not in the speaking state.

In this embodiment, referring to fig. 5, recognizing a speaking status of a conversation object of a voice conversation based on an interface image of a conversation interface may include:

step S221 a: identifying the display state of the state identifier corresponding to each session object based on the interface image of the session interface;

step S222 a: and if the display state of the state identifier of the first session object meets a first state condition, determining that the first session object is in a speaking state, wherein the first session object is any session object in the voice session.

The electronic equipment can identify the area image where the state identifier corresponding to each session object is located based on the interface image of the session interface; and identifying the display state of the state identifier according to the area image where the state identifier is located. After the display state of the state identifier is recognized, whether the display state of the state identifier meets a first state condition or not can be judged, and if the display state of the state identifier meets the first state condition, the conversation object corresponding to the state identifier is determined to be in a speaking state; and if the first state condition is not met, determining that the session object corresponding to the state identifier is not in the speaking state. The first state condition may be a condition set based on a display state of the state flag when the session object is in the speaking state. By the method, whether the conversation object is in the speaking state can be quickly and conveniently identified.

Optionally, after acquiring the region image where the state identifier is located, the electronic device may input the region image to a first pre-trained recognition model, where the first recognition model is pre-trained, so as to output a result of whether the first state condition is satisfied according to the input region image. The first recognition model may be a neural network, etc., and is not limited herein.

In other embodiments, where the voice session is a remote video session (e.g., a video conference, a video chat, etc.), the interface image may include a video image region of the remote video session, and since the video image region generally includes a user image of a session object participating in the remote video session, the mouth state of each session object may be determined based on the mouth region of the user image to determine whether each session object is in a speaking state.

In this embodiment, referring to fig. 6, recognizing a speaking status of a conversation object of a voice conversation based on an interface image of a conversation interface may include:

step S221 b: identifying a mouth state of each session object in the video image area based on an interface image of the session interface;

step S222 b: and if the mouth state of a second session object meets a second state condition, determining that the second session object is in a speaking state, wherein the second session object is any session object in the voice session.

The electronic equipment can identify a user face image corresponding to each session object in a video image area based on an interface image of a session interface and a preset face image of each session object, and then obtain a corresponding relation between each user face image and each session object in the video image area; then, based on each user face image, identifying a mouth region in the user face image corresponding to each conversation object to obtain a mouth region corresponding to each conversation object; and then based on the mouth region corresponding to each conversation object, the mouth state of each conversation object can be determined.

After determining the mouth state of each conversation object in the remote video conversation, it may be determined whether the mouth state of the conversation object satisfies a second state condition, which may be a condition set based on the mouth state of the conversation object when the conversation object is in the speaking state, for example, the second state condition may be that the opening degree of the mouth is greater than a preset opening degree, and the like. If the second state condition is met, determining that the conversation object corresponding to the mouth state is in the speaking state; and if the second state condition is not satisfied, determining that the session object corresponding to the mouth state is not in the speaking state.

Optionally, after acquiring the mouth region of the conversation object, the electronic device may also input the mouth region to a second recognition model trained in advance, where the second recognition model is trained in advance to output a result of whether the second state condition is satisfied according to the input mouth region. The second recognition model may be a neural network, etc., and is not limited herein.

In a possible implementation manner, if the voice session is a remote video session, the electronic device may also combine the above two implementation manners to determine whether the session object is in a speaking state. Optionally, considering that when the user performs a voice session, an environmental sound may be input, so that the display state of the state identifier corresponding to the user in the voice session interface satisfies the first state condition, the electronic device may determine, based on the interface image of the session interface, the mouth state of the session object in the video image when determining that the display state of the state identifier of the session object satisfies the first state condition, and may determine that the session object is in the speaking state if the mouth state of the session object also satisfies the second state condition; if the mouth state of the session object does not satisfy the second state condition, it may be determined that the session object is not in the speaking state.

Optionally, considering that the mouth state of the user meets the second state condition, it may also be the case that the user eats something, yawns, and the like, so that the electronic device may determine the display state of the state identifier of the session object based on the interface image of the session interface when determining that the mouth state of the session object meets the second state condition based on the video image region in the interface image; if the display state of the state identifier of the session object meets the first state condition, determining that the session object is in a speaking state; if the display state of the state identifier of the session object does not satisfy the first state condition, it may be determined that the session object is not in the speaking state.

In a possible implementation manner, when the electronic device performs a voice conversation, a user currently performing voice input, that is, a user in a speaking state, may be shown in the possible conversation interface, and therefore, the electronic device may also identify an area showing the user currently performing voice input, and identify a user name by means of Optical Character Recognition (OCR) to obtain a conversation object in the speaking state.

In some embodiments, since the conversation objects in the speaking state at different times may be different, if the conversation objects need to be accurately associated with the recognition texts corresponding to the speaking voices, the conversation objects in the speaking state at different times need to be accurately recognized. Therefore, the electronic device can acquire interface images of the conversation interface at different moments to determine the speaking states of the conversation objects at various moments in the above manner. Alternatively, the electronic device may intercept an interface image of the conversation interface and record a timestamp so as to obtain the speaking status of the conversation object associated with the timestamp.

Step S230: and acquiring the object information of the conversation object in the speaking state to obtain the speaker information.

In the embodiment of the application, after the electronic device identifies the speaking state of the conversation object of the voice conversation based on the interface image of the conversation interface, the electronic device may determine the object information of the conversation object in the speaker state, so as to obtain the speaker information, so that the speaker information is associated with the identification text corresponding to the speaking voice and then displayed.

The object information may be a user identifier (e.g., a user avatar, a user name, etc.) of the session object, a corresponding user image of the session object in the video image area, and so on. For example, the interface image includes a state identifier for representing a speaking state of a session object in a voice session, at this time, the state identifier is usually associated with a user identifier in the session interface, and at this time, the user identifier associated with the state identifier satisfying the first state condition (i.e., the state identifier corresponding to the session object in the speaking state) may be used as object information of the session object in the speaker state; for another example, if the interface image includes a video image area of the remote video session, and the session object in the speech state is specified by recognizing the mouth state of each session object in the video image area, the user image (for example, the head area) whose mouth state satisfies the second state condition can be used as the object information of the session object in the speech state.

In some embodiments, since the speaker information at different times may be different, when determining the conversation object in the speaker state as the speaker, the determined speaker may be recorded in association with the timestamp, so that the speaker can be accurately associated with the recognized text corresponding to the speech in the following.

Step S240: and recognizing the speech of the speech conversation to obtain a recognition text corresponding to the speech.

In the embodiment of the present application, step S240 may refer to the contents of other embodiments, which are not described herein again.

Step S250: and displaying the identification text after associating the identification text with the speaker information.

In a possible implementation manner, if the session object in the speaking state is determined by recognizing the state identifier in the interface image, in this manner, the acquired object information of the session object in the speaking state is the user identifier corresponding to the session object, and the user image may be associated with the recognition text and then displayed.

In one possible embodiment, if the conversation object in the speaking state is identified by identifying the mouth state of each conversation object in the video image area through the interface image based on the conversation interface, and the acquired object information of the conversation object in the speaking state is the user image corresponding to the conversation object, the user identifier may be associated with the identification text and then displayed.

According to the text display method provided by the embodiment of the application, the conversation object in the speaking state is identified as the speaker through the interface image of the conversation interface based on the voice conversation, and then the identification text corresponding to the speaking voice is associated with the identified speaker for display, so that the voiceprint information of the conversation object does not need to be pre-recorded by a user before the voice conversation, the problem that the voiceprint information of the conversation object is pre-recorded by the user is complex is avoided, the flexibility of converting the conversation voice into the text is improved, and the user experience is improved.

Referring to fig. 7, fig. 7 is a flowchart illustrating a text presentation method according to another embodiment of the present application. The text display method is applied to the electronic device, and will be described in detail with respect to the flow shown in fig. 7, where the text display method specifically includes the following steps:

step S310: a session interface for the voice session is displayed.

Step S320: based on the interface content of the conversation interface, the speaker information of the voice conversation is identified, and the speaker information comprises object information of conversation objects in speaking states at different moments.

Step S330: and recognizing the speech of the speech conversation to obtain a recognition text corresponding to the speech.

In the embodiment of the present application, steps S310 to S330 may refer to contents of other embodiments, which are not described herein again.

Step S340: and acquiring object information of a target session object corresponding to the timestamp in the speaker information based on the timestamp corresponding to the speaking voice, wherein the target session object is in a speaking state at the moment corresponding to the timestamp.

In this embodiment of the application, after acquiring the recognition text corresponding to the utterance voice and acquiring the speaker information, the electronic device may acquire, based on a timestamp corresponding to the speaker voice, object information of a target conversation object corresponding to the timestamp in the speaker information, where the target conversation object is in an utterance state at a time corresponding to the timestamp, that is, the speaker corresponding to the timestamp is the target conversation object, so as to associate the recognition text corresponding to the utterance voice at the same time with the speaker recognized at the same time, and further achieve accurate association between the recognition text and the conversation object.

In some embodiments, the speaker information may include object information of a conversation object in a speaking state at a time corresponding to different timestamps, and the electronic device may determine, according to the timestamp corresponding to the speaking voice and the object information of the conversation object in the speaking state at the time corresponding to the different timestamps, object information of a target conversation object corresponding to the timestamp corresponding to the speaking voice in the speaker information.

In some embodiments, there may be a case where it is determined that the same timestamp corresponds to object information (i.e., multiple speakers) of multiple session objects in the speaking state, and if the recognition text is subsequently and directly associated with multiple session objects and then displayed, the user cannot distinguish the session object to which the recognition text belongs based on the displayed result, that is, the user cannot know who speaks the content. Therefore, when it is determined that the same timestamp corresponds to a plurality of conversation objects in the speaking state, a preset screening rule can be further used for determining a target conversation object to be subsequently associated with the recognition text for the plurality of conversation objects, and object information of the target speaker can be obtained.

In one possible embodiment, different conversational objects are pre-set with different speaking priorities, wherein the probability of speaking by a conversational object is positively correlated with the speaking priority, i.e. the higher the speaking priority the conversational object has the higher the probability of speaking, and the lower the speaking priority the conversational object has the lower the probability of speaking. For example, in a video conference scenario, session objects of different roles may have different speaking priorities, with the higher the role of the session object, the higher the speaking priority. When the object information of a plurality of session objects corresponding to the same timestamp is determined, the speaking priority corresponding to each of the plurality of session objects may be acquired, and then the session object with the highest speaking priority may be determined from the plurality of session objects based on the speaking priority corresponding to each of the session objects, and the object information of the session object may be obtained as the object information of the target session object to be subsequently associated with the identification text.

In another possible implementation, since the voice input is continuous when the user speaks, a target conversation object to be subsequently associated with the recognition text may be determined from the plurality of conversation objects in the speaking state based on the rule, and object information of the target conversation object may be obtained. The electronic equipment can acquire the text length of the identification text and the continuous duration of the speaking state in the plurality of conversation objects in the time period corresponding to the time stamp under the condition that the object information of the plurality of conversation objects is determined to correspond to the same time stamp; then, determining the speaking duration corresponding to the text length of the recognized text according to the corresponding relation between the preset text length and the speaking duration; and matching the speaking duration with the continuous duration corresponding to each conversation object so as to obtain the continuous duration matched with the speaking duration, and obtaining the conversation object corresponding to the matched continuous duration and the object information of the conversation object as the object information of a target conversation object to be subsequently associated with the identification text.

It should be noted that, for the determined object information of the multiple session objects, a manner of determining object information of a target session object to be subsequently associated with the recognition text may not be limited. For example, the two embodiments may also be combined, and the object information of the target session object to be subsequently associated with the recognition text is determined from the determined object information of the plurality of session objects.

Step S350: and displaying the identification text after associating the identification text with the object information of the target session object.

In this embodiment of the application, after acquiring the identification text and the object information of the target conversation object corresponding to the timestamp corresponding to the speaking voice, the electronic device may associate the identification text with the object information of the target conversation object and display the association. Therefore, the method and the device can accurately associate and display the text content converted by the speaking voice corresponding to the speaker, and are convenient for a user to know the speaker corresponding to the text after the speaking voice is recognized.

In some embodiments, when the electronic device associates and displays the recognition text with the target session object, the electronic device may associate and display the recognition text with the user identifier corresponding to the target session object. Alternatively, the user identifier may be a user avatar, and the electronic device may present the recognition text and the user avatar as an input in a voice conversation, for example, the user avatar may be presented before the recognition text; optionally, the user identifier may also be a user name, and the electronic device may present the identification text and the user name as an input content in the voice session, for example, the user name is "a", and the identification text is: in the case of "must complete task B", it may be presented as "user name a: taking care to complete task B ". Of course, the way in which the electronic device specifically associates the recognition text with the user identifier corresponding to the target session object and then displays the recognition text may not be limited.

According to the text display method, when the electronic equipment is used for voice conversation, the speaker information is recognized through the interface content of the conversation interface, then the object information of the target conversation object matched with the timestamp corresponding to the speaking voice is determined from the speaker information, the recognized text corresponding to the speaking voice is displayed after being associated with the determined target conversation object, therefore, the text corresponding to the speaking voice and the speaker information corresponding to the speaking voice can be accurately displayed after being associated, a user can conveniently know the speech corresponding to the text after the speaking voice is recognized, the complex problem that the user inputs voiceprint information of the conversation object in advance is avoided, and user experience is improved.

Referring to fig. 8, fig. 8 is a flowchart illustrating a text presentation method according to still another embodiment of the present application. The text display method is applied to the electronic device, and will be described in detail with respect to the flow shown in fig. 8, where the text display method specifically includes the following steps:

step S410: displaying a conversation interface of the voice conversation, wherein the conversation interface comprises a text display control.

In the embodiment of the application, a text display control is included in the session interface. The text display control is used for displaying the recognition text corresponding to the conversation voice recognized in real time in the voice conversation process and the speaker.

Step S420: based on the interface content of the conversation interface, speaker information of the voice conversation is identified.

Step S430: and recognizing the speech of the speech conversation to obtain a recognition text corresponding to the speech.

Step S440: and displaying the recognition text corresponding to the speaking voice at the current moment in the text display control after associating the recognition text with the object information of the conversation object in the speaking state at the current moment.

In the embodiment of the application, when the electronic device identifies and obtains the speaker information through the interface content of the conversation interface and identifies the speaking voice of the voice conversation to obtain the identification text corresponding to the speaking voice, the electronic device can identify the speaker information based on the interface content of the conversation interface in real time and identify the speaking voice of the voice conversation in real time to obtain the identification text corresponding to the speaking voice at the current time and the object information of the conversation object in the speaking state at the current time. After the recognition text corresponding to the speaking voice at the current time and the object information of the conversation object in the speaking state at the current time are acquired, the recognition text corresponding to the speaking voice at the current time and the object information of the conversation object in the speaking state at the current time can be associated and then displayed in the text display control.

Illustratively, referring to fig. 9, a text presentation control Q1 may be included in the conversation interface a1, and the text presentation control Q1 is used for presenting the recognition text corresponding to the conversation voice recognized in real time and the user name of the conversation object in the speaking state at the current time, so that the user can know the recognized text content and the speaker through the text presentation control Q1 when viewing the conversation interface a 1.

Step S450: and responding to a first preset operation aiming at the text display control, and displaying a voice recognition record in the voice conversation process in the conversation interface, wherein the voice recognition record comprises recognition texts corresponding to speaking voices at different moments in the voice conversation and object information of a conversation object in a speaking state corresponding to the different moments.

In this embodiment of the present application, the text presentation control may be further configured to trigger presentation of a voice recognition record in a voice conversation process. The electronic equipment can detect operation in a session interface, and when the operation for the text display control is detected, whether the operation is a first preset operation can be determined; if the operation is a first preset operation, the voice recognition record in the voice conversation process can be displayed in the conversation interface in response to the first preset operation. The voice recognition record may include recognition texts corresponding to speaking voices at different times in the voice conversation, and object information of conversation objects in speaking states corresponding to different times, that is, text contents recognized in history and object information of speakers corresponding to the text contents.

In some embodiments, the electronic device detects an operation in the session interface, and may detect a press operation, a click operation, a slide operation, a drag operation, and the like in the session interface. The first preset operation may be a click operation for a text display control; the first preset operation may also be a sliding operation for the text display control, where when the first preset operation is a sliding operation, a sliding track of the first preset operation meets a preset sliding track. Of course, the specific first preset operation may not be limited in the embodiment of the present application.

In some embodiments, in response to a first preset operation for the text presentation control, when a voice recognition record in the course of a voice conversation is presented in the conversation interface, the electronic device may display the text presentation control as a recognition detail panel, for example, the recognition detail panel may be displayed in a middle area of the conversation interface, and the display of the text presentation control may be cancelled; the voice recognition record in the course of the voice conversation is then displayed in the recognition details panel. Therefore, the text display control can be used as a floating ball to display the text content recognized in real time and the object information of the corresponding speaker, and can be used for triggering the display of all recognized contents in the conversation process.

In some embodiments, after displaying the voice recognition record in the course of the voice conversation in the conversation interface in response to the first preset operation for the text display control, the electronic device may further continue to detect an operation in the conversation interface, so as to control the voice recognition record to be in an editable state in response to the second preset operation for the voice recognition record in a case where the second preset operation for the voice recognition record is detected; the voice recognition recording is modified in response to an editing operation for the voice recognition recording.

As one possible embodiment, the voice recognition record may include a plurality of pieces of recognition content, where the plurality of pieces of recognition content are text content recognized for utterance voices at different times and target information of speakers corresponding to the utterance voices. When the electronic equipment displays a plurality of pieces of identification content, object information, text content and time of a speaker in each piece of identification content can be displayed, when a second preset operation for one piece of identification content is detected, the identification content can be controlled to be in an editable state in response to the second preset operation, then the identification content is modified according to the editing operation of a user for the identification content, and the modified content is stored.

Alternatively, the second preset operation may be a pressing operation that satisfies a preset pressing condition. The preset pressing conditions may be: the pressing duration is longer than the preset duration, or the pressing strength is greater than the preset force, and the specific preset pressing condition can be unlimited. Therefore, the user can perform the pressing operation on the recognition result in the voice recognition record, namely the recognition result can be modified, and the requirement of the user for recording the accurate recognition text is met.

In a possible implementation manner, the electronic device controls the voice recognition record to be in an editable state in response to the second preset operation on the voice recognition record, and after modifying the voice recognition record in response to the editing operation on the voice recognition record, if the electronic device meets a modification uploading condition, the modification result and the modified content may be uploaded to the server, so that the server optimizes the functions of recognizing and displaying the voice in the voice session of the electronic device, and then issues the optimized system version to the electronic device. The upload condition may be a condition that the electronic device agrees to the user experience plan, and the like, which is not limited herein.

In some embodiments, after presenting the voice recognition recording in the course of the voice conversation in the conversation interface in response to the first preset operation for the text presentation control, the electronic device may further continue to detect an operation in the conversation interface to generate the recorded file of the voice recognition recording in response to the file export operation for the voice recognition recording in a case where the file export operation for the voice recognition recording is detected. Optionally, in a case that the electronic device presents the voice recognition record in the session interface, a file export control may be displayed in the session interface, and the file export operation may be an operation on the file export control in the session interface, such as a click operation, a press operation, and the like, which is not limited herein.

In some embodiments, when the electronic device displays the recognition text after associating the recognition text with the speaker information, the electronic device may further translate the recognition text into a translation text in a preset language, and then display the recognition text and the translation text after associating the recognition text and the translation text with the speaker information. The predetermined language may be Chinese, English, Korean, etc., and is not limited herein.

Alternatively, the preset language may be a preset language, for example, the electronic device may set a language type selected by the language setting operation as the preset language in response to the language setting operation input by the user; the preset language may also be a default language type, for example, if the preset language is not preset, the current system language of the electronic device may be default as the preset language.

In some embodiments, the electronic device may further perform voice announcement using Text-To-Speech (TTS) technology in response To an announcement operation for recognized Text or translated Text To assist a language-handicapped user in a voice conversation.

Of course, the above embodiments may also be implemented in combination. When the electronic device displays the voice recognition record, the text display control can be displayed as a recognition detail panel, for example, the recognition detail panel can be displayed in the middle area of the session interface, and the display of the text display control is cancelled; the voice recognition record in the course of the voice conversation is then displayed in the recognition details panel. The recognition detail panel can be displayed on a conversation interface in a floating window with a dark background, and can be used for realizing the functions of modifying a recognition result, exporting a file and the like while displaying a voice recognition record by a user.

Illustratively, referring to fig. 10, the recognition detail panel may include a history button H, a function setting button S, a scene setting button B1, an input panel TV, a recording presentation area B2, and a function control button P displayed in the recording presentation area B2. The record display area B2 is used for displaying the voice recognition record; the history button H is used to trigger the presentation of the detailed content of the voice recognition record, referring to fig. 11, after detecting the operation of the history button H, the presentation history page a2 may be triggered to present the voice recognition record in detail, the presented voice recognition record may include a user identifier, time, text content, and the like, and the recognition content of the speaking voice at different times may be presented by responding to the user's up and down sliding operation on the history page a 2. The history page a2 may further include a file export button for exporting the voice recognition record to a record file.

The function setting button B1 may be used to trigger display of a function setting page, and referring to fig. 12, after detecting an operation to the function setting button B1, the display function setting page A3 may be triggered, and the function setting page A3 may be used for a user to set a transcription language (i.e., a language in which speech is recognized as text), a translation language, a vocal type of voice broadcast, and the like. The scene setting button B1 described above can be used to set an identification scene, such as a conference scene, a chat scene, etc.; the input panel TV can be used for remarking the recognition result in the voice conversation process, and the input panel TV comprises a determining button E which is used for determining the input remark content; the function control buttons P are used to control the pause and start of speech recognition.

In addition, the electronic device may further control the display state of the identification detail panel in response to an operation for the identification detail panel, for example, may control the identification detail panel to move in response to a drag operation for the identification detail panel; for another example, the display size of the identification detail panel may be controlled in response to a long press operation for the identification detail panel.

In some embodiments, in the case that the electronic device displays the text presentation control in the conversation interface, the electronic device may further respond to a sliding operation for the text presentation control to control the position of the text presentation control in the conversation interface according to the sliding operation. For example, the text presentation control can be displayed on the left side in the conversation interface, and the electronic device can control the text presentation control to move up and down along with the sliding operation in response to the sliding operation of the text presentation control up or down.

In some embodiments, in the case that the electronic device displays the text presentation control in the conversation interface, the electronic device may further control the text presentation control to move along with the sliding operation in response to the sliding operation for the text presentation control, and when the text presentation control is located at the specified position, the hiding of the text presentation control is triggered, that is, the display of the text presentation control is cancelled. For example, referring to fig. 13, a text display control may be displayed on the left side of the session interface, and when a sliding operation for the text display control is detected, prompt information may be displayed in a bottom area of the session interface to prompt a user to drag the text display control to a target position of the bottom area for hiding; when the text display control is dragged to the target position of the bottom area, the display of the text display control can be cancelled.

The text display method provided by the embodiment of the application can display the text corresponding to the speech and the object information of the conversation object corresponding to the speech after being associated by using the interface content of the conversation interface when the electronic equipment is used for carrying out the speech conversation, so that the problem that a user inputs voiceprint information of the conversation object in advance is avoided, and the user experience is improved. And when the text corresponding to the speaking voice is associated with the speaker information and then displayed, the identification content is displayed through the middle text display control of the conversation interface, the user can operate the text display control to display the identification record in the voice conversation process, and the display effect and the operation experience of the user are improved.

The text display method according to the foregoing embodiment will be described with reference to fig. 14.

As shown in fig. 14, in the case that the electronic device starts the function of the voice subtitle, if it is determined that the current scene is a voice conversation scene, the electronic device may acquire the recognition language of the voice recognition and the translation language of the recognition text, and acquire the system audio in real time and convert the system audio into an audio stream; intercepting a screen picture to obtain a session interface, and recording a timestamp; performing STT according to the system audio and the speech recognition language, and acquiring a recognition text; identifying speaker information according to the intercepted screen picture; then, rendering is carried out according to the recognition text in cooperation with the speaker information and the timestamp, and result display is carried out; the user can correct the displayed result according to the displayed result, and can upload the corrected text to the server for SST function optimization under the condition of agreeing to the user experience plan; finally, the recognition text can be translated based on an accurate result, and the page with the translation result is displayed, so that a user can check the recognition text and the translation result conveniently, and in addition, the translation result supports TTS (text to speech) broadcasting.

Referring to fig. 15, a block diagram of a text display apparatus 400 according to an embodiment of the present application is shown. The text display apparatus 400 employs the above-mentioned electronic device, and the text display apparatus 400 includes: an interface display module 410, a user recognition module 420, a voice recognition module 430, and a recognition output module 440. The interface display module 410 is configured to display a session interface of a voice session; the user identification module 420 is configured to identify speaker information of the voice session based on interface content of the session interface; the speech recognition module 430 is configured to recognize a speech of the speech session, and obtain a recognition text corresponding to the speech; the recognition output module 440 is configured to associate the recognition text with the speaker information and then display the associated recognition text.

In some embodiments, the subscriber identity module 420 may be specifically configured to: recognizing the speaking state of a conversation object of the voice conversation based on the interface image of the conversation interface; and acquiring the object information of the conversation object in the speaking state to obtain the speaker information.

As a possible implementation, the interface image includes a state identifier for characterizing a speaking state of a session object in the voice session. The recognizing module 420 recognizes the speaking status of the conversation object of the voice conversation based on the interface image of the conversation interface, which may include: identifying the display state of the state identifier corresponding to each session object based on the interface image of the session interface; and if the display state of the state identifier of the first session object meets a first state condition, determining that the first session object is in a speaking state, wherein the first session object is any session object in the voice session.

As a possible implementation, the voice session is a remote video session, and the interface image includes a video image area of the remote video session. The recognizing module 420 recognizes the speaking status of the conversation object of the voice conversation based on the interface image of the conversation interface, which may include: identifying a mouth state of each session object in the video image area based on an interface image of the session interface; and if the mouth state of a second session object meets a second state condition, determining that the second session object is in a speaking state, wherein the second session object is any session object in the voice session.

In some embodiments, the speaker information includes object information of session objects in a speaking state at different times. The recognition output module 440 may be specifically configured to: acquiring object information of a target session object corresponding to the timestamp in the speaker information based on the timestamp corresponding to the speaking voice, wherein the target session object is in a speaking state at the moment corresponding to the timestamp; and displaying the identification text after associating the identification text with the object information of the target session object.

As a possible implementation manner, the displaying the recognition text after the recognition output module 440 associates the recognition text with the target speaker may include: and displaying the identification text after associating the identification text with the user identification corresponding to the target session object.

In some implementations, the conversation interface includes a text presentation control. The recognition output module 440 may be specifically configured to: and displaying the recognition text corresponding to the speaking voice at the current moment in the text display control after associating the recognition text with the object information of the conversation object in the speaking state at the current moment.

As a possible implementation, the text display apparatus 400 may further include a first response module. The first response module is used for responding to a first preset operation aiming at the text display control after the recognition text corresponding to the speaking voice at the current moment is associated with the speaker at the current moment and displayed in the text display control, and displaying a voice recognition record in the voice conversation process in the conversation interface, wherein the voice recognition record comprises the recognition text corresponding to the speaking voice at different moments in the voice conversation and the object information of the conversation object in the speaking state corresponding to different moments.

Optionally, the text display apparatus 400 may further include a second response module. The second response module is used for responding to a second preset operation aiming at the voice recognition record after responding to the first preset operation aiming at the text display control and displaying the voice recognition record in the voice conversation process in the conversation interface, and controlling the voice recognition record to be in an editable state; modifying the voice recognition recording in response to an editing operation for the voice recognition recording.

Optionally, the text display apparatus 400 may further include a third response module. The third response module is used for responding to a file export operation aiming at the voice recognition record after responding to the first preset operation aiming at the text display control and displaying the voice recognition record in the voice conversation process in the conversation interface, and generating a record file of the voice recognition record.

In some embodiments, the text presentation apparatus 400 may further include a text translation module. The text translation module is used for translating the recognition text into a translation text in a preset language before the recognition text is associated with the speaker information and then displayed. In this manner, the recognition output module 440 may be configured to associate the recognized text and the translated text with the speaker information and then display the associated recognized text and the translated text.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and modules may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, the coupling between the modules may be electrical, mechanical or other type of coupling.

In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.

In summary, according to the scheme provided by the application, the speaker information of the voice conversation is identified by displaying the conversation interface of the voice conversation and based on the interface content of the conversation interface, the speaking voice of the voice conversation is identified, the identification text corresponding to the speaking voice is obtained, and then the identification text is associated with the speaker information and displayed. Therefore, when the electronic equipment is used for voice conversation, the interface content of the conversation interface is used for displaying the text corresponding to the speech and the speaker corresponding to the speech after the text corresponding to the speech is associated with the speaker, the problem that a user inputs voiceprint information of a conversation object in advance is solved, and user experience is improved.

Referring to fig. 16, a block diagram of an electronic device according to an embodiment of the present disclosure is shown. The electronic device 100 may be a smart phone, a tablet computer, a smart watch, an electronic book, or other electronic devices capable of running an application. The electronic device 100 in the present application may include one or more of the following components: a processor 110, a memory 120, and one or more applications, wherein the one or more applications may be stored in the memory 120 and configured to be executed by the one or more processors 110, the one or more applications configured to perform a method as described in the aforementioned method embodiments.

Processor 110 may include one or more processing cores. The processor 110 connects various parts within the overall electronic device 100 using various interfaces and lines, and performs various functions of the electronic device 100 and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 120 and calling data stored in the memory 120. Alternatively, the processor 110 may be implemented in hardware using at least one of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 110 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing display content; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 110, but may be implemented by a communication chip.

The Memory 120 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). The memory 120 may be used to store instructions, programs, code sets, or instruction sets. The memory 120 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing various method embodiments described below, and the like. The data storage area may also store data created by the electronic device 100 during use (e.g., phone book, audio-video data, chat log data), and the like.

Referring to fig. 17, a block diagram of a computer-readable storage medium according to an embodiment of the present application is shown. The computer-readable medium 800 has stored therein a program code that can be called by a processor to execute the method described in the above-described method embodiments.

The computer-readable storage medium 800 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Alternatively, the computer-readable storage medium 800 includes a non-volatile computer-readable storage medium. The computer readable storage medium 800 has storage space for program code 810 to perform any of the method steps of the method described above. The program code can be read from or written to one or more computer program products. The program code 810 may be compressed, for example, in a suitable form.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not necessarily depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A text presentation method, the method comprising:

displaying a session interface of the voice session;

based on the interface content of the session interface, identifying speaker information of the voice session;

recognizing speech of the speech conversation to obtain a recognition text corresponding to the speech;

and displaying the identification text after associating the identification text with the speaker information.

2. The method of claim 1, wherein identifying speaker information for the voice conversation based on interface content of the conversation interface comprises:

recognizing the speaking state of a conversation object of the voice conversation based on the interface image of the conversation interface;

and acquiring the object information of the conversation object in the speaking state to obtain the speaker information.

3. The method of claim 2, wherein the interface image comprises a state identifier for characterizing a speaking state of a conversation object in the voice conversation, and wherein the recognizing the speaking state of the conversation object in the voice conversation based on the interface image of the conversation interface comprises:

identifying the display state of the state identifier corresponding to each session object based on the interface image of the session interface;

and if the display state of the state identifier of the first session object meets a first state condition, determining that the first session object is in a speaking state, wherein the first session object is any session object in the voice session.

4. The method of claim 2, wherein the voice session is a remote video session, the interface image comprises a video image area of the remote video session, and the identifying the speaking status of the session object of the voice session based on the interface image of the session interface comprises:

identifying a mouth state of each session object in the video image area based on an interface image of the session interface;

and if the mouth state of a second session object meets a second state condition, determining that the second session object is in a speaking state, wherein the second session object is any session object in the voice session.

5. The method according to claim 1, wherein the speaker information includes object information of a conversation object in a speaking state at different times, and the associating the recognition text with the speaker information and displaying the recognition text includes:

acquiring object information of a target session object corresponding to the timestamp in the speaker information based on the timestamp corresponding to the speaking voice, wherein the target session object is in a speaking state at the moment corresponding to the timestamp;

and displaying the identification text after associating the identification text with the object information of the target session object.

6. The method of claim 5, wherein associating the recognition text with the target session object for presentation comprises:

and displaying the identification text after associating the identification text with the user identification corresponding to the target session object.

7. The method according to any one of claims 1-6, wherein the conversation interface includes a text presentation control, and wherein associating the recognized text with the speaker information for presentation comprises:

and displaying the recognition text corresponding to the speaking voice at the current moment in the text display control after associating the recognition text with the object information of the conversation object in the speaking state at the current moment.

8. The method of claim 7, wherein after the identifying text corresponding to the speaking voice at the current time is associated with the speaker at the current time and displayed in the text presentation control, the method further comprises:

and responding to a first preset operation aiming at the text display control, displaying a voice recognition record in the voice conversation process in the conversation interface, wherein the voice recognition record comprises recognition texts corresponding to speaking voices at different moments in the voice conversation and object information of conversation objects in speaking states corresponding to the different moments.

9. The method of claim 8, wherein after the presenting a voice recognition recording in the course of the voice conversation in the conversation interface in response to the first preset operation for the text presentation control, the method further comprises:

controlling the voice recognition record to be in an editable state in response to a second preset operation aiming at the voice recognition record;

modifying the voice recognition recording in response to an editing operation for the voice recognition recording.

10. The method of claim 8, wherein after the presenting a voice recognition recording in the course of the voice conversation in the conversation interface in response to the first preset operation for the text presentation control, the method further comprises:

generating a record file of the speech recognition record in response to a file export operation for the speech recognition record.

11. The method of any of claims 1-6, wherein prior to said associating said recognized text with said speaker information for presentation, said method further comprises:

translating the recognition text into a translation text of a preset language;

the displaying after associating the recognition text with the speaker information includes:

and displaying the identification text and the translation text after associating the identification text and the translation text with the speaker information.

12. A text presentation device, the device comprising: an interface display module, a user identification module, a voice identification module and an identification output module, wherein,

the interface display module is used for displaying a conversation interface of the voice conversation;

the user identification module is used for identifying the speaker information of the voice conversation based on the interface content of the conversation interface;

the voice recognition module is used for recognizing the speaking voice of the voice conversation to obtain a recognition text corresponding to the speaking voice;

and the recognition output module is used for displaying the recognition text after associating the recognition text with the speaker information.

13. An electronic device, comprising:

one or more processors;

a memory;

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the method of any of claims 1-11.

14. A computer-readable storage medium, having stored thereon program code that can be invoked by a processor to perform the method according to any one of claims 1 to 11.