WO2024093443A1 - 基于语音交互的信息展示方法、装置和电子设备 - Google Patents

基于语音交互的信息展示方法、装置和电子设备 Download PDF

Info

Publication number
WO2024093443A1
WO2024093443A1 PCT/CN2023/113531 CN2023113531W WO2024093443A1 WO 2024093443 A1 WO2024093443 A1 WO 2024093443A1 CN 2023113531 W CN2023113531 W CN 2023113531W WO 2024093443 A1 WO2024093443 A1 WO 2024093443A1
Authority
WO
WIPO (PCT)
Prior art keywords
interaction
segment
real
time
information
Prior art date
Application number
PCT/CN2023/113531
Other languages
English (en)
French (fr)
Inventor
李想
杨文海
Original Assignee
北京字跳网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京字跳网络技术有限公司 filed Critical 北京字跳网络技术有限公司
Publication of WO2024093443A1 publication Critical patent/WO2024093443A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/106Display of layout of documents; Previewing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection

Definitions

  • the present disclosure relates to the field of Internet technology, and in particular to an information display method, device and electronic device based on voice interaction.
  • an embodiment of the present disclosure provides an information display method based on voice interaction, the method comprising: determining the interaction segmentation of the real-time voice interaction based on the operation information of the interaction-related documents for the real-time voice interaction; and displaying the segmentation information of the determined interaction segment.
  • the present disclosure provides an information display method based on voice interaction.
  • the method comprises: performing speech recognition on a speech signal time period in the real-time speech interaction to obtain a speech recognition result; determining an interaction segment of the real-time speech interaction according to the speech recognition result; and displaying segmentation information of the determined interaction segment.
  • an embodiment of the present disclosure provides an information display device based on voice interaction, comprising: a recognition module, used to perform voice recognition on a voice signal time period in the real-time voice interaction to obtain a voice recognition result; a determination module, used to determine the interaction segment of the real-time voice interaction based on the voice recognition result; and a display module, used to display segment information of the determined interaction segment.
  • an embodiment of the present disclosure provides an information display device based on voice interaction, comprising: a determination unit for determining an interaction segment of the real-time voice interaction based on operation information of interaction-related documents for the real-time voice interaction; and a display unit for displaying segment information of the determined interaction segment.
  • an embodiment of the present disclosure provides an electronic device, comprising: one or more processors; a storage device for storing one or more programs, wherein when the one or more programs are executed by the one or more processors, the one or more processors implement the information display method based on voice interaction as described in the first aspect.
  • an embodiment of the present disclosure provides a computer-readable medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the information display method based on voice interaction as described in the first aspect.
  • FIG1 is a flow chart of an embodiment of a method for presenting information based on voice interaction according to the present disclosure
  • FIG2 is a flow chart of an optional implementation according to the present disclosure.
  • FIG3 is a flow chart of an optional implementation according to the present disclosure.
  • FIG. 4 is an application scenario of the information display method based on voice interaction according to the present disclosure. Schematic diagram of the scene
  • FIG5 is a schematic diagram of an application scenario of the information display method based on voice interaction according to the present disclosure
  • FIG6 is a schematic diagram of an application scenario of the information display method based on voice interaction according to the present disclosure.
  • FIG7A is a schematic diagram of an application scenario of the information display method based on voice interaction according to the present disclosure
  • FIG7B is a schematic diagram of an application scenario of the information display method based on voice interaction according to the present disclosure.
  • FIG7C is a schematic diagram of an application scenario of the information display method based on voice interaction according to the present disclosure.
  • FIG8 is a flow chart of an embodiment of a method for displaying information based on voice interaction according to the present disclosure
  • FIG9 is a structural diagram of an embodiment of a device for displaying information based on voice interaction according to the present disclosure
  • FIG10 is a schematic diagram of the structure of an embodiment of an information display device based on voice interaction according to the present disclosure
  • FIG11 is an exemplary system architecture in which an information display method based on voice interaction according to an embodiment of the present disclosure can be applied;
  • FIG. 12 is a schematic diagram of a basic structure of an electronic device provided according to an embodiment of the present disclosure.
  • Figure 1 shows a flow chart of an embodiment of a method for displaying information based on voice interaction according to the present disclosure.
  • the method for displaying information based on voice interaction includes the following steps:
  • Step 101 Determine the interaction segment of the real-time voice interaction based on the operation information of the interaction-related document for the real-time voice interaction.
  • the execution subject (such as a server and/or a terminal device) of the information display method based on voice interaction can determine the interaction segmentation of the above-mentioned real-time voice interaction based on the operation information of the interaction-related documents in the real-time voice interaction.
  • the real-time voice interaction can be understood as voice interaction
  • the segmentation of the real-time voice interaction can be called interaction segmentation.
  • the real-time voice interaction may be a voice interaction performed in real time using an electronic device, for example, may include online interaction using multimedia.
  • the multimedia may include but is not limited to at least one of audio and video.
  • the real-time voice interaction interface may be a related interface of the real-time voice interaction.
  • the application that enables real-time voice interaction can be any type of application, which is not limited here.
  • the application can be an instant video interaction application, a communication application, a video playback application, and an email application.
  • the interaction segment of the real-time voice interaction can be bound to the interaction time point, and the time period between two interaction time points is used as the interaction segment.
  • the interaction-related documents may include documents related to the interaction.
  • the interaction-related documents may include but are not limited to at least one of the following: a shared document bound to the interaction, and a document displayed during screen sharing.
  • a shared document bound to the interaction may be bound to the interaction before the interaction, or may be bound to the interaction during the interaction (i.e., a document shared during the meeting).
  • the operation information on the interaction-related document may indicate the operation on the interaction-related document.
  • the operation on the interactively related documents may include but is not limited to at least one of the following: switching documents, opening documents, closing documents, browsing documents, selecting document titles, and annotating documents.
  • the interaction segment can be determined according to the user's switching operation on different interaction-related documents, and the time point of switching between different interaction-related documents can be used as the interaction segment.
  • the interaction segmentation can be determined according to the user's operation of switching the title of the interaction-related document.
  • the time point of each time the user switches the title of the interaction-related document can be used as the demarcation point of the interaction segmentation.
  • the interaction segmentation can be determined according to the user's operation of browsing the interaction-related documents, and the time point of turning pages of the interaction-related documents can be used as the demarcation point of the interaction segmentation.
  • Step 102 Display segment information of the determined interaction segment.
  • the execution entity may display the segment information of the determined interaction segment.
  • the segment information may indicate relevant conditions of the interaction segment.
  • the segment information may include but is not limited to at least one of the following: segment time, segment topic.
  • the information display position of the segment can be determined according to the actual application scenario and is not limited here.
  • the segment information may be displayed in the interaction summary area.
  • the segmented information may include text converted from interactive speech.
  • the solution of the present application can be implemented offline or in real time.
  • the segmentation of the recording of the real-time multimedia conference is essentially an offline process.
  • the information display method based on voice interaction can provide a new method for determining interaction segmentation by determining the interaction segmentation of real-time voice interaction based on the operation information on interaction-related documents, so that the determined interaction segmentation can refer to the document display process of the interaction-related documents.
  • participating users may interact with the display process of interaction-related documents. Therefore, by determining the interaction segmentation based on the operation information on the interaction-related documents, and displaying the segmentation information, it is possible to determine the interaction segmentation that fits the real-time voice interaction process more accurately, thereby improving the accuracy of determining the interaction segmentation and the interaction information.
  • the embodiment of the present application implements document-based interaction video segmentation, which can effectively structure the interaction and help users find and locate the interaction content.
  • the above step 101 may include: determining the interaction segment of the real-time voice interaction according to the operation information on the interaction-related document and the sound signal of the real-time voice interaction.
  • the sound signals of real-time voice interaction can be divided into different categories according to different classification criteria.
  • speech signals and non-speech signals may be included; if classified by sound intensity, sound signals may include sound signals greater than a preset intensity threshold and sound signals not greater than a preset intensity threshold.
  • the portion with sound intensity greater than the preset intensity threshold may be detected first according to the preset intensity threshold, and then the voice signal may be detected in this portion.
  • the sound signal may be divided into a voice signal period and a period not including the voice signal.
  • a period of time in the sound signal that does not include a voice signal can be used as a demarcation of interaction segments.
  • the voice signal period in the sound signal can be segmented according to the operation information for the interaction-related document. For example, the operation of switching the interaction-related document in the voice signal period can be used as a demarcation point of the voice signal period to segment the voice signal period.
  • the participating user may stop talking when changing the content of different topics.
  • the period of stopping talking may indicate the dividing point between the interaction segments of the real-time voice interaction. Therefore, the real-time voice interaction is segmented by combining the operation information of the interaction-related documents and the sound signal of the real-time voice interaction.
  • the operation information and the sound signal which are two types of information that can characterize the dividing point of the interaction segmentation, can be referred to to determine a more accurate interaction segmentation.
  • the above steps determine the interaction segmentation of the real-time voice interaction according to the operation information on the interaction-related document and the sound signal of the real-time voice interaction, which may include the process shown in Figure 2.
  • the process shown in Figure 2 may include steps 201, 202 and 203.
  • Step 201 performing speech recognition on a speech signal period in real-time speech interaction to obtain a speech recognition result.
  • the sound signal of the real-time voice interaction may include a voice signal. Therefore, according to whether the time period includes a voice signal that lasts for a preset time, the voice signal time period can be determined from the real-time voice interaction. In the determination of the preset duration, an interruption duration threshold can be set. If the interruption duration between two voice signals is less than the interruption duration threshold, the two voice signals can be considered to be continuous voice signals, and there is no interruption between them.
  • voice recognition can be performed on the voice signal period in the real-time voice interaction to obtain a voice recognition result, which can include text information.
  • Step 202 segmenting the real-time voice interaction according to the semantic segmentation result of the voice recognition result to obtain candidate segments.
  • the speech recognition results can be semantically segmented to obtain the corresponding segments of the text information.
  • Each segmented text information can correspond to the time point of the real-time voice interaction.
  • the above step 202 may include: semantically dividing the speech recognition results into at least two segments; determining the dividing point of the real-time voice interaction segmentation based on the time dividing point between two adjacent speech recognition results, and obtaining two adjacent candidate segments of the real-time voice interaction.
  • the speech recognition results are semantically divided into two segments; thus, the time points corresponding to the speech recognition results divided into two segments can be used as the dividing points of the real-time voice interaction segments, and two candidate segments of the real-time voice interaction are obtained.
  • multimedia can be segmented to obtain candidate segments.
  • Step 203 According to the operation information on the interaction-related documents, the candidate segments are adjusted to obtain the interaction segments.
  • At least one of the following operations may be performed according to the above operation information: merging two candidate segments into one interactive segment, adjusting the time point of an existing candidate segment, and dividing one candidate segment into at least two interactive segments.
  • the speech recognition result can be semantically segmented to obtain candidate segments, and then the candidate segments can be adjusted according to the operation information on the interaction-related documents, thereby improving the accuracy of the interaction segmentation.
  • the above steps determine the interaction segmentation of the real-time voice interaction based on the operation information of the interaction-related document and the sound signal of the real-time voice interaction, and may include: if the duration of the period in which the sound signal does not include the voice signal is greater than a preset first duration threshold, then determining the part as a first type of interaction segment.
  • the specific value of the preset first duration threshold can be set according to the actual application scenario, for example, it can be 30 seconds.
  • this time period can be merged into the previous or subsequent time period, or the time period can be split and partly merged with the previous time period and partly merged with the subsequent time period.
  • the silent time period in the interaction can be accurately found.
  • the sound signal in the interaction may include voice signals or non-voice signals; for the time period including non-voice signals, even if the time period includes voice signals, it can be excluded from the segmentation of voice signals through the division of this implementation method, thereby improving the accuracy of segmentation based on interaction.
  • the above step 203 may include: determining the title switching time of the interaction-related document according to the presentation position information of the interaction-related document; and adjusting the start and end times of the candidate segments according to the title switching time.
  • the title switching time is used to indicate the time of switching different sub-sections of the interactively related documents.
  • the above-mentioned presentation location information may include document location information bound to time.
  • the presentation location information may be the location where the document is presented.
  • the above demonstration location information can be determined in multiple ways.
  • the presentation location information may be determined based on at least one of the following: a title switching operation, a document focus, or document subject information corresponding to a comment.
  • the title switching operation may include the user's triggering determination of different items in the title, and may also include the user's triggering determination of titles at all levels in interactively related documents.
  • the title switching time of the interactively related document may indicate the switching time of different items in the title.
  • the title switching time between the first section and the second section may indicate the time when the user switches the first section to the second section of the document.
  • the time of the candidate segment is adjusted according to the title switching time.
  • the document subject information corresponding to the comments can be used to determine the title switching time. For example, if the display of comments in the first section changes to the display of comments in the second section, the change time can be determined as the title switching time.
  • the interactive display focus can be captured during the interaction to adjust the candidate segments and combine them with the sound signal to comprehensively improve the segmentation accuracy from both the sound and visual aspects.
  • FIG3 shows an optional implementation of the above step 102.
  • the process shown in FIG3 may include step 1011 and step 1012.
  • Step 1021 construct a hierarchical relationship of interaction segments based on the voice signal and/or document switching operation in the real-time voice interaction.
  • Step 1022 displaying segment information with a hierarchical relationship.
  • the document switching operation can be used to switch the interaction-related documents of the real-time voice interaction.
  • the number of the interaction-related documents of the real-time voice interaction is two, numbered as the first document and the second document, and the document switching operation can switch from the first document to the second document.
  • the interaction segmentation can be presented at different levels to reflect the relationship between the segments of the interaction segmentation.
  • the number of interaction segments obtained is three, numbered as the first segment, the second segment, and the third segment; the hierarchy of the interaction segments is constructed, and the interaction segments are determined to be two levels, the first segment and the third segment belong to the first level, and the second segment belongs to the second level of the first segment; accordingly, the segment information of the first segment and the third segment is used as the first level interaction title, and the segment information of the second segment is used as the second level interaction title.
  • Title, and this interactive second-level title is below the interactive first-level title corresponding to the first paragraph.
  • displaying the segment information with hierarchical relationship may include displaying the relationship between the interaction segments in various forms, for example, the segment information of the first and third segments of the first level are displayed at the top, and the segment information of the second segment is displayed indented under the segment information of the first level.
  • step 1021 may include: in response to no document switching operation being detected in the real-time voice interaction, determining an interaction first-level title of the real-time voice interaction based on the document first-level title of the interaction-related document.
  • the document directory can include multi-level titles, and the first-level title in the document directory can be called the document first-level title.
  • the interaction may include multiple levels of segments.
  • the first level segment of the interaction may include the second level segment of the interaction
  • the second level segment of the interaction may include the third level segment of the interaction.
  • the interaction directory may include multiple levels of interaction titles.
  • the first level title in the interaction directory may be called the first level interaction title, indicating the first level interaction segment.
  • the titles at all levels in the interaction directory may indicate the interaction segment.
  • the titles at all levels in the interaction may be the segmentation topics of the interaction segment, and the hierarchical relationship of the titles at all levels of the interaction is consistent with the hierarchical relationship of the interaction segments.
  • FIG. 4 shows a scenario in which the first-level document title of the interaction-related document is used as the first-level interaction title.
  • the play area 401 can play the interactive video of the real-time voice interaction.
  • the interactive first-level title 402 can be the document first-level title of the A document.
  • the interactive second-level title 403 can be the document second-level title in the A document, and the interactive second-level title 403 is the next-level title of the interactive first-level title 402.
  • Section 1.1 belongs to the first chapter.
  • the interactive first-level title 403 can be the document first-level title in the A document.
  • the interaction segments and interaction titles of the real-time voice interaction are determined based on the document titles of the interaction-related documents, so that the interaction segments and interaction titles of the real-time voice interaction can be realized based on the interaction-related documents.
  • online interactions quickly and accurately determine the interaction progress.
  • the document identifier can be used as the first-level title, and the interaction first-level title of the interaction-related document can be used as the second-level title of the interaction segment.
  • step 1021 includes: in response to detecting a document switching operation in the real-time voice interaction, determining the first-level interaction title of the real-time voice interaction based on the document identifier of the interaction-related document; and determining the N-level interaction title of the real-time voice interaction based on the document title of the interaction-related document, where N ⁇ 2.
  • FIG. 5 shows a scenario in which the document identifier of the interaction-related document is used as the interaction first-level title.
  • the play area 501 can play the interactive video of the real-time voice interaction.
  • the interactive first-level title 502 can be the document identifier of the A document.
  • the interactive second-level title 503 can be the document first-level title in the A document, and the interactive second-level title 503 is the sub-level title of the interactive first-level title 502, and the interactive third-level title 504 is the sub-level title of the interactive second-level title 503.
  • Section 1.1 belongs to the first chapter.
  • the interactive first-level title 505 can be the document identifier of the B document.
  • the document identifier is used as the first-level interaction title, and the interaction time period of the real-time voice interaction can be divided with the document as the main node; the interaction N-level title (N ⁇ 2) of the real-time voice interaction is determined according to the document title, and some interaction segments corresponding to the same interaction-related document can be used to quickly determine the hierarchical relationship with the help of document segmentation.
  • step 1012 may include: displaying the segment information of the second type of interaction segment as an interaction first-level title.
  • the real-time voice interaction during the first type of interaction segment does not display interaction-related documents.
  • the interaction segment is determined as a second type of interaction segment.
  • the first level interactive title 506 may indicate the second type interactive segment. As shown in FIG. 5 , the first level interactive title 506 may indicate the word “discussion”. Indicates that this interaction segment is a discussion between participants.
  • the first type interaction segment and the document identification of the interaction-related document are at the same level.
  • interaction segments that do not display interaction-related documents as the first-level interaction titles so that the user discussion segments and the interaction periods of document sharing are listed side by side, can make the hierarchical relationship of the interaction segments more accurate and reasonable, and when it is necessary to review the interaction process, the interaction structure can be quickly determined.
  • the 1021 includes: for the target interaction period in the real-time voice interaction, determining whether the target interaction period includes a third type of interaction segment based on the voice signal in the target interaction period; if the target interaction period includes a third type of segment, adding a preset indication mark for the interaction segment after the third type of interaction segment in the target interaction period.
  • the segment information level is lowered for the interaction segment after the third type of interaction segment in the target interaction period, and a preset indicator is displayed in front of the segment information after the lowering.
  • the interaction segments in the target interaction period correspond to the same interaction-related document.
  • the segment topics of the interaction segments in the target interaction period belong to the same interaction-related document.
  • the target interaction period includes three interaction segments, the first interaction segment is communication between users, the second interaction segment is a third type of interaction segment, and the third interaction segment is a discussion with document A as the object.
  • the first interaction segment has at least one of the following characteristics: at the beginning of the recording of the real-time voice interaction, multiple people frequently speak alternately, and the voice recognition is short sentences ( ⁇ 30 words/sentence), and the duration is >1 minute, then the segment is marked at the beginning of the video, and the segment title is an introduction.
  • the second interaction segment has at least one of the following characteristics: after the first interaction segment, there is a silence segment of more than 5 minutes, then it can be considered as the document reading stage (i.e., the third interaction segment). After the document reading stage, the interaction segmentation is carried out according to the document title, and a first-level title "Comment" can be added before the document title.
  • the determination conditions of the third type of interaction segmentation include the length of the speech silence time.
  • a third duration threshold eg, 5 minutes.
  • FIG. 6 shows an exemplary scenario in which the target interaction period includes a third type of interaction segment.
  • the play area 601 can play the interactive video of the real-time voice interaction.
  • the preset indicator 602 e.g., marked with the word "comment”
  • the second-level interactive title 603 is the first-level document title of document A (i.e., the first chapter).
  • the third-level interactive title 604 is the next-level title of the second-level interactive title 603.
  • section 1.1 belongs to the first chapter.
  • the preset indicator 605 is displayed in the interaction as the first-level interactive title and displayed before the second-level interactive title 606.
  • the second-level interactive title is the first-level document title of document A (i.e., the second chapter).
  • the segment information may include segment topics.
  • the above step 1022 may include: determining the segmentation topics of the interaction segment according to the document content of the interaction-related document in the real-time voice interaction; and displaying the segmentation topics.
  • FIG. 7A shows a related scenario for displaying segment information.
  • the play area 701 can play the interactive video of the real-time voice interaction.
  • the document content of the interactive related document may include the selection of lunch and dinner.
  • the real-time voice interaction segment may include two segments, the first segment corresponds to the title in the document or the summary of the document content (i.e., what to eat for lunch), i.e., the segment title 702, and the sub-title 703 of the segment title 702 (indicated with noodles); the second segment corresponds to another title in the document or the summary of the document content (i.e., what to eat for dinner), i.e., the segment title 704.
  • the segmented topics of the determined interaction segments can refer to the interaction-related documents. It can be understood that in real-time voice interaction with interaction-related documents, the characteristics of the interaction-related documents being fitted to the interaction can be fully utilized to accurately determine the segmented topics of the interaction segments, so that users can quickly understand the interaction process according to the segmented topics and improve the efficiency of obtaining interaction-related information.
  • displaying the segmented topics may include: displaying the segmented topics having a hierarchical relationship in the interaction minutes.
  • FIG. 7A shows an interaction minutes display area 705 .
  • segmented topics can be displayed, and there is a hierarchical relationship between the segmented topics.
  • the method further includes: in response to a trigger operation on the displayed segmented topic, jumping the recorded interaction video to the triggered interaction segment, and playing the triggered interaction segment.
  • the play area 701 may play the interactive segment indicated by the segment title 704 .
  • the display of the segmentation information of the determined interaction segment includes at least one of the following but is not limited to: during the real-time voice interaction process, displaying the segmentation information of the determined interaction segment; during the real-time voice interaction process and/or after the real-time voice interaction ends, displaying the segmentation information of the determined interaction segment in the voice recognition result corresponding to the real-time voice interaction.
  • displaying the interaction segmentation information can facilitate the users in the interaction to view the previous interaction structure in time and facilitate the users in the interaction to recall the interaction content that has been communicated.
  • displaying the segmentation information of the interaction segment in the speech recognition results can enable the user to intuitively obtain the interaction structure when recalling the content with the help of the speech recognition results, and can enable the user to further understand the speech recognition results with the help of the interaction structure, thereby helping the user to quickly obtain the interaction content.
  • the display of the segmentation information of the determined interaction segment may include at least one of the following but is not limited to: displaying the segmentation information corresponding to the time point on the timeline corresponding to the real-time voice interaction; displaying the segmentation information of the interaction segment in association with the document content information; and displaying the segmentation information of the interaction segment in association with the document structure.
  • FIG. 7B shows the timeline 706 corresponding to the real-time voice interaction in the interactive video.
  • the document content of the interactive related document may include the selection of lunch and dinner.
  • the real-time voice interaction segment may include two segments, namely, what to eat at noon starting from the 30th minute of the interaction and what to eat at night starting from the 60th minute.
  • On the timeline 706, what to eat at noon corresponding to the 30th minute may be displayed, and what to eat at night corresponding to the 60th minute may be displayed.
  • the document content information can be used to indicate the document content.
  • the document content information can include the document body and the document title.
  • the segment time corresponding to each body part can be displayed in the document body.
  • the document structure can be used to indicate the structure of a document.
  • the document structure can interact with the structure of related documents.
  • the document structure display area 707 of FIG. 7C can display the document structure, and the document structure includes what to eat at noon in the first part of the document and what to eat at night in the second part of the document.
  • the segment time of the interactive segment i.e., 00:00-30:00
  • the segment time of the interactive segment i.e., 30:01-60:00
  • Figure 8 shows a flow chart of an embodiment of a method for displaying information based on voice interaction according to the present disclosure.
  • the method for displaying information based on voice interaction includes the following steps:
  • Step 801 performing speech recognition on a speech signal period in the real-time speech interaction to obtain a speech recognition result.
  • Step 802 Determine the interaction segment of the real-time voice interaction according to the voice recognition result.
  • Step 803 Display the segment information of the determined interaction segment.
  • the interaction segmentation of the real-time voice interaction can be determined according to the voice recognition result, thereby, the speech recognition result can be used to indicate the difference in segmentation content between different interaction segments, and the interaction segmentation can be determined.
  • the accuracy of the interaction segmentation can be improved.
  • determining the interaction segment of the real-time voice interaction according to the voice recognition result includes: performing voice recognition on the voice signal time period in the real-time voice interaction to obtain the voice recognition result; and determining the interaction segment of the real-time voice interaction according to the semantic division result of the voice recognition result.
  • the real-time voice interaction is segmented to obtain candidate segments; and the interaction segment is determined according to the candidate segment time.
  • the real-time voice interaction is segmented according to the semantic division results of the voice recognition results to obtain candidate segments, including: semantically dividing the voice recognition results to divide the voice recognition results into at least two segments; determining the demarcation point of the real-time voice interaction segmentation according to the time demarcation point between two adjacent voice recognition results, to obtain two adjacent candidate segments of the real-time voice interaction.
  • the technical features of the embodiment corresponding to Figure 8 can be combined with any technical features or technical solutions in other embodiments of the present application.
  • Figure 9 as an implementation of the methods shown in the above figures, the present disclosure provides an embodiment of an information display device based on voice interaction, the device embodiment corresponds to the method embodiment shown in Figure 1, and the device can be specifically applied to various electronic devices.
  • the information display device based on voice interaction of this embodiment includes: a determination unit 901 and a display unit 902.
  • the determination unit is used to determine the interaction segment of the real-time voice interaction based on the operation information of the interaction-related document for the real-time voice interaction; the display unit is used to display the segment information of the determined interaction segment.
  • step 101 and step 102 the specific processing of the determination unit 901 and the display unit 902 of the voice interaction-based information display device and the technical effects brought about by them can be referred to the relevant descriptions of step 101 and step 102 in the corresponding embodiment of Figure 1 respectively, and will not be repeated here.
  • determining the interaction segmentation of the real-time voice interaction based on the operation information of the interaction-related documents for the real-time voice interaction includes: determining the interaction segmentation of the real-time voice interaction according to the operation information of the interaction-related documents and the sound signal of the real-time voice interaction.
  • the sound signal of the real-time voice interaction includes a voice signal; and determining the interaction segmentation of the real-time voice interaction based on the operation information for the interaction-related document and the sound signal of the real-time voice interaction includes: performing voice recognition on the voice signal time period in the real-time voice interaction to obtain a voice recognition result; segmenting the real-time voice interaction based on the semantic division result of the voice recognition result to obtain candidate segments; adjusting the candidate segmentation time based on the operation information for the interaction-related document to obtain the interaction segmentation.
  • segmenting the real-time voice interaction according to the semantic division results of the voice recognition results to obtain candidate segments includes: semantically dividing the voice recognition results to divide the voice recognition results into at least two segments; determining the dividing point of the real-time voice interaction segmentation according to the time dividing point between two adjacent voice recognition results to obtain two adjacent candidate segments of the real-time voice interaction.
  • the method of adjusting the candidate segmentation time according to the operation information on the interaction-related document to obtain the interaction segmentation includes: determining the title switching time of the interaction-related document according to the presentation position information of the interaction-related document; adjusting the start and end times of the candidate segmentation according to the title switching time, and the title switching time is used to indicate the time for switching different sub-parts of the interaction-related document.
  • the presentation location information is determined based on at least one of the following: a title switching operation, document subject information corresponding to a document focus, and document subject information corresponding to a currently displayed comment.
  • adjusting the candidate segmentation time according to the operation information on the interaction-related document to obtain the interaction segmentation includes: merging the two candidate segments in response to the time interval between the start time points of two candidate segments being less than a preset first duration threshold.
  • determining the interaction segmentation of the real-time voice interaction based on the operation information of the interaction-related document and the sound signal of the real-time voice interaction includes: if the duration of a time period in which the sound signal does not include a voice signal is greater than a preset first duration threshold, then determining the time period as a first type of interaction segment.
  • the displaying of the segmentation information of the determined interaction segmentation includes: constructing a hierarchical relationship of the interaction segmentation based on the voice signal and/or document switching operation in the real-time voice interaction; and displaying the segmentation information having the hierarchical relationship.
  • the hierarchical relationship of the interaction segments is constructed based on the voice signal and/or document switching operation in the real-time voice interaction, including: in response to no document switching operation being detected in the real-time voice interaction, determining the interaction first-level title of the real-time voice interaction based on the document first-level title of the interaction-related document.
  • the step of constructing a hierarchical relationship of interaction segments based on the voice signal and/or document switching operation in the real-time voice interaction includes: A document switching operation is detected during the voice interaction, and based on the document identifier of the interaction-related document, the first-level interaction title of the real-time voice interaction is determined; based on the document title of the interaction-related document, the N-level interaction title of the real-time voice interaction is determined, where N ⁇ 2.
  • the display of segmented information with a hierarchical relationship includes: displaying the segmented information of the second type of interaction segment as an interaction first-level title, wherein the real-time voice interaction during the second type of interaction segment does not display interaction-related documents, and the duration of the second type of interaction segment is greater than a preset second duration threshold.
  • the hierarchical relationship of interaction segments is constructed based on the voice signal and/or document switching operation in the real-time voice interaction, including: for the target interaction period in the real-time voice interaction, determining whether the target interaction period includes a third type of interaction segment according to the voice signal in the target interaction period, wherein the target interaction period corresponds to the same interaction-related document, and the determination condition of the third type of interaction segment includes that the voice silence duration is greater than a third duration threshold; if the target interaction period includes the third type of segment, for the interaction segment after the third type of interaction segment in the target interaction period, the segment information level is lowered, and a preset indication mark is displayed in front of the segment information after the lowering.
  • the segmentation information includes segmentation topics; and the display of the segmentation information having a hierarchical relationship includes: determining the segmentation topics of the interaction segment based on the document content of the interaction-related documents in the real-time voice interaction; and displaying the segmentation topics.
  • displaying the segmented topics includes: displaying the segmented topics with a hierarchical relationship in the interaction minutes.
  • the apparatus is further configured to: in response to a trigger operation on the segmented theme, jump the recorded interaction video to the triggered interaction segment, and play the triggered interaction segment.
  • the present disclosure provides an embodiment of an information display device based on voice interaction.
  • the device embodiment corresponds to the method embodiment shown in FIG. 8 , and the device can be specifically applied to various electronic devices.
  • the information display device based on voice interaction of this embodiment includes: a recognition module 1001, a determination module 1002 and a display unit 1003.
  • the recognition module is used
  • a voice recognition module is used to perform voice recognition on a voice signal time period in the real-time voice interaction to obtain a voice recognition result
  • a determination module is used to determine an interaction segment of the real-time voice interaction according to the voice recognition result
  • a display module is used to display segment information of the determined interaction segment.
  • determining the interaction segmentation of the real-time voice interaction based on the speech recognition result includes: performing speech recognition on the speech signal time period in the real-time voice interaction to obtain the speech recognition result; segmenting the real-time voice interaction based on the semantic division result of the speech recognition result to obtain candidate segments; and determining the interaction segmentation based on the candidate segmentation time.
  • segmenting the real-time voice interaction according to the semantic segmentation result of the voice recognition result to obtain candidate segments includes: semantically segmenting the voice recognition result to divide the voice recognition result into at least two segments; determining the demarcation point of the real-time voice interaction segmentation according to the time demarcation point between two adjacent voice recognition results to obtain two adjacent candidate segments of the real-time voice interaction.
  • Figure 11 shows an exemplary system architecture in which the information display method based on voice interaction of an embodiment of the present disclosure can be applied.
  • the system architecture may include terminal devices 1101, 1102, 1103, a network 1104, and a server 1105.
  • the network 1104 is used to provide a medium for communication links between the terminal devices 1101, 1102, 1103 and the server 1105.
  • the network 1104 may include various connection types, such as wired, wireless communication links, or optical fiber cables.
  • the terminal devices 1101, 1102, and 1103 can interact with the server 1105 through the network 1104 to receive or send messages, etc.
  • Various client applications such as web browser applications, search applications, and news information applications, can be installed on the terminal devices 1101, 1102, and 1103.
  • the client applications in the terminal devices 1101, 1102, and 1103 can receive user instructions and perform corresponding functions according to the user instructions, such as adding corresponding information to the information according to the user instructions.
  • Terminal devices 1101, 1102, and 1103 may be hardware or software. When terminal devices 1101, 1102, and 1103 are hardware, they may be various electronic devices with display screens and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, and MP3 players (Moving Picture Experts Group Audio Layer III, dynamic The terminal devices 1101, 1102, and 1103 may be implemented as multiple software or software modules (for example, software or software modules for providing distributed services), or as a single software or software module. No specific limitation is given here.
  • the server 1105 may be a server that provides various services, such as receiving information acquisition requests sent by the terminal devices 1101, 1102, and 1103, acquiring display information corresponding to the information acquisition requests in various ways according to the information acquisition requests, and sending the relevant data of the display information to the terminal devices 1101, 1102, and 1103.
  • the information display method based on voice interaction provided in the embodiment of the present disclosure can be executed by a terminal device, and accordingly, the information display device based on voice interaction can be set in the terminal devices 1101, 1102, and 1103.
  • the information display method based on voice interaction provided in the embodiment of the present disclosure can also be executed by a server 1105, and accordingly, the information display device based on voice interaction can be set in the server 1105.
  • terminal devices, networks and servers in Figure 11 is only illustrative. Any number of terminal devices, networks and servers may be provided according to implementation requirements.
  • FIG. 12 shows a schematic diagram of the structure of an electronic device (e.g., the terminal device or server in FIG. 11) suitable for implementing the embodiment of the present disclosure.
  • the terminal device in the embodiment of the present disclosure may include, but is not limited to, mobile terminals such as mobile phones, laptop computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), vehicle-mounted terminals (e.g., vehicle-mounted navigation terminals), etc., and fixed terminals such as digital TVs, desktop computers, etc.
  • the electronic device shown in FIG. 12 is only an example and should not impose any limitations on the functions and scope of use of the embodiment of the present disclosure.
  • the electronic device may include a processing device (e.g., a central processing unit, a graphics processing unit, etc.) 1201, which may perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 1202 or a program loaded from a storage device 1208 to a random access memory (RAM) 1203.
  • ROM read-only memory
  • RAM random access memory
  • Various programs and data required for the operation of the electronic device are also stored in the RAM 1203. 1203 are connected to each other via a bus 1204.
  • An input/output (I/O) interface 1205 is also connected to the bus 1204.
  • the following devices may be connected to the I/O interface 1205: input devices 1206 including, for example, a touch screen, a touchpad, a keyboard, a mouse, a camera, a microphone, an accelerometer, a gyroscope, etc.; output devices 1207 including, for example, a liquid crystal display (LCD), a speaker, a vibrator, etc.; storage devices 1208 including, for example, a magnetic tape, a hard disk, etc.; and communication devices 1209.
  • the communication device 1209 may allow the electronic device to communicate wirelessly or wired with other devices to exchange data.
  • FIG. 12 shows an electronic device with various devices, it should be understood that it is not required to implement or have all the devices shown. More or fewer devices may be implemented or have alternatively.
  • an embodiment of the present disclosure includes a computer program product, which includes a computer program carried on a non-transitory computer-readable medium, and the computer program contains program code for executing the method shown in the flowchart.
  • the computer program can be downloaded and installed from a network through a communication device 1209, or installed from a storage device 1208, or installed from a ROM 1202.
  • the processing device 1201 the above-mentioned functions defined in the method of the embodiment of the present disclosure are executed.
  • the above-mentioned computer-readable medium of the present disclosure may be a computer-readable signal medium or a computer-readable storage medium or any combination of the above two.
  • the computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device or device, or any combination of the above.
  • Computer-readable storage media may include, but are not limited to: an electrical connection with one or more wires, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, device or device.
  • a computer-readable signal medium may include a data signal propagated in a baseband or as part of a carrier wave, The computer-readable program code is carried therein. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above.
  • the computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which may send, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device.
  • the program code contained on the computer-readable medium may be transmitted using any suitable medium, including but not limited to: wires, optical cables, RF (radio frequency), etc., or any suitable combination of the above.
  • the client and server may communicate using any currently known or future developed network protocol such as HTTP (HyperText Transfer Protocol), and may be interconnected with any form or medium of digital data communication (e.g., a communication network).
  • HTTP HyperText Transfer Protocol
  • Examples of communication networks include a local area network ("LAN”), a wide area network ("WAN”), an internet (e.g., the Internet), and a peer-to-peer network (e.g., an ad hoc peer-to-peer network), as well as any currently known or future developed network.
  • the computer-readable medium may be included in the electronic device, or may exist independently without being incorporated into the electronic device.
  • the computer-readable medium carries one or more programs.
  • the electronic device determines the interaction segment of the real-time voice interaction based on the operation information of the interaction-related document for the real-time voice interaction; and displays the segment information of the determined interaction segment.
  • determining the interaction segmentation of the real-time voice interaction based on the operation information of the interaction-related documents for the real-time voice interaction includes: determining the interaction segmentation of the real-time voice interaction according to the operation information of the interaction-related documents and the sound signal of the real-time voice interaction.
  • the sound signal of the real-time voice interaction includes a voice signal; and the determining the interaction segmentation of the real-time voice interaction according to the operation information for the interaction-related document and the sound signal of the real-time voice interaction includes: performing voice recognition on the voice signal time period in the real-time voice interaction to obtain a voice recognition result; segmenting the real-time voice interaction according to the semantic division result of the voice recognition result to obtain candidate segments; adjusting the candidate segments according to the operation information for the interaction-related document The time is segmented to obtain the interaction segment.
  • the candidate segmentation time is adjusted according to the operation information on the interaction-related document to obtain the interaction segment, including: determining the title switching time of the interaction-related document according to the presentation position information of the interaction-related document; and adjusting the start and end time of the candidate segment according to the title switching time.
  • the presentation location information is determined based on at least one of the following: a title switching operation, document subject information corresponding to a document focus, and document subject information corresponding to a currently displayed comment.
  • adjusting the candidate segmentation time according to the operation information on the interaction-related document to obtain the interaction segmentation includes: merging the two candidate segments in response to the time interval between the start time points of two candidate segments being less than a preset first duration threshold.
  • determining the interaction segmentation of the real-time voice interaction based on the operation information of the interaction-related document and the sound signal of the real-time voice interaction includes: if the duration of a time period in which the sound signal does not include a voice signal is greater than a preset first duration threshold, then determining the time period as a first type of interaction segment.
  • the displaying of the segmentation information of the determined interaction segmentation includes: constructing a hierarchical relationship of the interaction segmentation based on the voice signal and/or document switching operation in the real-time voice interaction; and displaying the segmentation information having the hierarchical relationship.
  • the hierarchical relationship of the interaction segments is constructed based on the voice signal and/or document switching operation in the real-time voice interaction, including: in response to no document switching operation being detected in the real-time voice interaction, determining the interaction first-level title of the real-time voice interaction based on the document first-level title of the interaction-related document.
  • the hierarchical relationship of the interaction segments is constructed based on the voice signal and/or document switching operation in the real-time voice interaction, including: in response to detecting a document switching operation in the real-time voice interaction, based on the document identifier of the interaction-related document, determining the first-level interaction title of the real-time voice interaction; based on the document title of the interaction-related document, determining the N-level interaction title of the real-time voice interaction, wherein N ⁇ 2.
  • the displaying of the segment information having a hierarchical relationship includes: displaying the segment information of the second type of interactive segment as an interactive first-level title, wherein: During the second type of interaction segment, the real-time voice interaction does not display interaction-related documents, and the duration of the second type of interaction segment is greater than a preset second duration threshold.
  • the hierarchical relationship of interaction segments is constructed based on the voice signal and/or document switching operation in the real-time voice interaction, including: for the target interaction period in the real-time voice interaction, determining whether the target interaction period includes a third type of interaction segment according to the voice signal in the target interaction period, wherein the target interaction period corresponds to the same interaction-related document, and the determination condition of the third type of interaction segment includes that the voice silence duration is greater than a third duration threshold; if the target interaction period includes the third type of segment, for the interaction segment after the third type of interaction segment in the target interaction period, the segment information level is lowered, and a preset indication mark is displayed in front of the segment information after the lowering.
  • the segmentation information includes segmentation topics; and the display of the segmentation information having a hierarchical relationship includes: determining the segmentation topics of the interaction segment based on the document content of the interaction-related documents in the real-time voice interaction; and displaying the segmentation topics.
  • displaying the segmented topics includes: displaying the segmented topics with a hierarchical relationship in the interaction minutes.
  • the electronic device is further used to: in response to a trigger operation on the segmented theme, jump the recorded interaction video to the triggered interaction segment, and play the triggered interaction segment.
  • the computer-readable medium carries one or more programs.
  • the electronic device When the one or more programs are executed by the electronic device, the electronic device: performs voice recognition on the voice signal time period in the real-time voice interaction to obtain a voice recognition result; determines the interaction segment of the real-time voice interaction based on the voice recognition result; and displays the segmentation information of the determined interaction segment.
  • Computer program code for performing operations of the present disclosure may be written in one or more programming languages, or a combination thereof, including, but not limited to, object-oriented programming languages such as Java, Smalltalk, C++, and conventional procedural programming languages such as "C" or similar programming languages.
  • the program code may execute entirely on the user's computer, partially on the user's computer, as a stand-alone software package, partially on the user's computer and partially on a remote computer, or The program or process may be executed entirely on a remote computer or server.
  • the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or it may be connected to an external computer (e.g., through the Internet using an Internet service provider).
  • LAN local area network
  • WAN wide area network
  • each square box in the flow chart or block diagram can represent a module, a program segment or a part of a code, and the module, the program segment or a part of the code contains one or more executable instructions for realizing the specified logical function.
  • the functions marked in the square box can also occur in a sequence different from that marked in the accompanying drawings. For example, two square boxes represented in succession can actually be executed substantially in parallel, and they can sometimes be executed in the opposite order, depending on the functions involved.
  • each square box in the block diagram and/or flow chart, and the combination of the square boxes in the block diagram and/or flow chart can be implemented with a dedicated hardware-based system that performs a specified function or operation, or can be implemented with a combination of dedicated hardware and computer instructions.
  • the units involved in the embodiments described in the present disclosure may be implemented by software or hardware.
  • the name of a unit does not limit the unit itself in some cases.
  • a selection unit may also be described as a "unit for selecting a first type of pixel".
  • exemplary types of hardware logic components include: field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chip (SOCs), complex programmable logic devices (CPLDs), and the like.
  • FPGAs field programmable gate arrays
  • ASICs application specific integrated circuits
  • ASSPs application specific standard products
  • SOCs systems on chip
  • CPLDs complex programmable logic devices
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device.
  • a machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • a machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing.
  • machine-readable storage media would include base An electrical connection based on one or more wires, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), optical fibers, a portable compact disk-read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
  • base An electrical connection based on one or more wires a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), optical fibers, a portable compact disk-read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read-only memory
  • EPROM or flash memory erasable programmable read-only memory
  • CD-ROM compact disk-read-only memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

一种基于语音交互的信息展示方法、装置和电子设备。该方法的一具体实施方式包括:基于针对实时语音交互的交互相关文档的操作信息,确定实时语音交互的交互分段(101);展示所确定的交互分段的分段信息(102)。由此,提供了一种新的基于语音交互的信息展示方式。

Description

基于语音交互的信息展示方法、装置和电子设备
相关申请的交叉引用
本申请要求于2022年10月31日提交的,申请号为202211351762.7、发明名称为“基于语音交互的信息展示方法、装置和电子设备”的中国专利申请的优先权,该申请的全文通过引用结合在本申请中。
技术领域
本公开涉及互联网技术领域,尤其涉及一种基于语音交互的信息展示方法、装置和电子设备。
背景技术
随着互联网的发展,用户越来越多的使用终端设备的功能,使得工作和生活更加便利。例如,用户可以通过终端设备在线与其他用户开启实时语音交互。用户之间通过线上实时语音交互,可以实现远距离交互,也可以实现用户不必集合在一处也可以开启交互。实时语音交互很大程度上避免了传统面对面交互关于地点和场地的限制。
发明内容
提供该公开内容部分以便以简要的形式介绍构思,这些构思将在后面的具体实施方式部分被详细描述。该公开内容部分并不旨在标识要求保护的技术方案的关键特征或必要特征,也不旨在用于限制所要求的保护的技术方案的范围。
第一方面,本公开实施例提供了一种基于语音交互的信息展示方法,该方法包括:基于针对实时语音交互的交互相关文档的操作信息,确定所述实时语音交互的交互分段;展示所确定的交互分段的分段信息。
第二方面,本公开实施例提供了一种基于语音交互的信息展示方 法,该方法包括:根据对所述实时语音交互中的语音信号时段进行语音识别,得到语音识别结果;根据所述语音识别结果,确定所述实时语音交互的交互分段;展示所确定的交互分段的分段信息。
第三方面,本公开实施例提供了一种基于语音交互的信息展示装置,包括:识别模块,用于根据对所述实时语音交互中的语音信号时段进行语音识别,得到语音识别结果;确定模块,用于根据所述语音识别结果,确定所述实时语音交互的交互分段;展示模块,用于展示所确定的交互分段的分段信息。
第四方面,本公开实施例提供了一种基于语音交互的信息展示装置,包括:确定单元,用于基于针对实时语音交互的交互相关文档的操作信息,确定所述实时语音交互的交互分段;展示单元,用于展示所确定的交互分段的分段信息。
第五方面,本公开实施例提供了一种电子设备,包括:一个或多个处理器;存储装置,用于存储一个或多个程序,当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如第一方面所述的基于语音交互的信息展示方法。
第六方面,本公开实施例提供了一种计算机可读介质,其上存储有计算机程序,该程序被处理器执行时实现如第一方面所述的基于语音交互的信息展示方法的步骤。
附图说明
结合附图并参考以下具体实施方式,本公开各实施例的上述和其他特征、优点及方面将变得更加明显。贯穿附图中,相同或相似的附图标记表示相同或相似的元素。应当理解附图是示意性的,原件和元素不一定按照比例绘制。
图1是根据本公开的基于语音交互的信息展示方法的一个实施例的流程图;
图2是根据本公开的可选实现方式的流程图;
图3是根据本公开的可选实现方式的流程图;
图4是根据本公开的基于语音交互的信息展示方法的一个应用场 景的示意图;
图5是根据本公开的基于语音交互的信息展示方法的一个应用场景的示意图;
图6是根据本公开的基于语音交互的信息展示方法的一个应用场景的示意图;
图7A是根据本公开的基于语音交互的信息展示方法的一个应用场景的示意图;
图7B是根据本公开的基于语音交互的信息展示方法的一个应用场景的示意图;
图7C是根据本公开的基于语音交互的信息展示方法的一个应用场景的示意图;
图8是根据本公开的基于语音交互的信息展示方法的一个实施例的流程图;图9是根据本公开的基于语音交互的信息展示装置的一个实施例的结构示意图;
图10是根据本公开的基于语音交互的信息展示装置的一个实施例的结构示意图;图11是本公开的一个实施例的基于语音交互的信息展示方法可以应用于其中的示例性系统架构;
图12是根据本公开实施例提供的电子设备的基本结构的示意图。
具体实施方式
下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。
应当理解,本公开的方法实施方式中记载的各个步骤可以按照不同的顺序执行,和/或并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。
本文使用的术语“包括”及其变形是开放性包括,即“包括但不 限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。
需要注意,本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分,并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。
需要注意,本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,本领域技术人员应当理解,除非在上下文另有明确指出,否则应该理解为“一个或多个”。
本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。
请参考图1,其示出了根据本公开的基于语音交互的信息展示方法的一个实施例的流程。如图1所示该基于语音交互的信息展示方法,包括以下步骤:
步骤101,基于针对实时语音交互的交互相关文档的操作信息,确定实时语音交互的交互分段。
在本实施例中,基于语音交互的信息展示方法的执行主体(例如服务器和/或终端设备)可以基于针对实时语音交互中的交互相关文档的操作信息,确定上述实时语音交互的交互分段。可以理解,实时语音交互可以理解为语音交互,实时语音交互的分段,可以称为交互分段。
在本实施例中,实时语音交互,可以利用电子设备实时进行的语音交互,例如可以包括利用多媒体方式进行的线上交互。多媒体可以包括但不限于音频和视频中的至少一者。实时语音交互界面,可以是实时语音交互的相关界面。
在本实施例中,开启实时语音交互的应用可以是任意种类的应用,在此不做限定。例如,上述应用可以是即时视频交互类应用、通讯类应用、视频播放类应用和邮件类应用等。
在这里,实时语音交互的交互分段,可以以与交互时间点绑定,以两个交互时间点之间的时间段作为交互分段。
在这里,交互相关文档可以包括与交互相关的文档。作为示例,交互相关文档可以包括但是不限于以下至少一项:与交互绑定的共享文档、共享屏幕期间展示的文档。与交互绑定的共享文档,可以在交互前与交互绑定,也可以在交互中与交互绑定(即会中共享的文档)。
在这里,针对交互相关文档的操作信息,可以指示对交互相关文档的操作。
作为示例,对交互相关文档的操作,可以包括但是不限于以下至少一项:切换文档、打开文档、关闭文档、浏览文档、选择文档标题、对文档进行批注。
作为示例,可以根据用户对不同交互相关文档的切换操作,确定交互分段。可以在不同交互相关文档之间切换的时间点,作为交互分段。
作为示例,可以根据用户对交互相关文档切换标题的操作,确定交互分段。可以将用户每次切换交互相关文档的标题的操作的时间点,作为交互分段的分界点。
作为示例,可以根据用户浏览交互相关文档的操作确定交互分段。可以将对交互相关文档进行翻页操作的时间点,作为交互分段的分界点。
步骤102,展示所确定的交互分段的分段信息。
在本实施例中,上述执行主体可以展示所确定的交互分段的分段信息。
在本实施例中,分段信息可以指示交互分段的相关情况。分段信息可以包括但是不限于以下至少一项:分段时间、分段主题。
在本实施例中,分段的信息展示位置可以根据实际应用场景确定,在此不做限定。
作为示例,可以将分段信息展示在交互纪要区域。
作为示例,可以将分段信息可以包括交互语音转换得到的文字。
在一些实施例中,本申请的方案可以离线实施,也可以针对实时 语音交互而实时进行。对实时多媒体会议的记录进行的分段,本质上还是一个离线的处理。
需要说明的是,本实施例提供的基于语音交互的信息展示方式,通过基于针对交互相关文档的操作信息,确定实时语音交互的交互分段,可以提供一种新的确定交互分段的方式,使得所确定的交互分段可以参考交互相关文档的文档展示进程。可以理解,在实时语音交互中,参会用户可能以交互相关文档的展示进程来开展交互。因此,基于对交互相关文档的操作信息确定交互分段,以及展示分段信息,可以确定与实时语音交互进程贴合更为准确的交互分段,提高确定交互分段以及交互信息的准确度。
对比来说,在一些相关技术中,没有很好的对交互分段的记录,用户在观看交互录制结果时效率低,需要手动拖拽进度条来找到和自己相关的部分。本申请实施例实现基于文档的交互视频分段,可以对交互进行有效的结构化,帮助用户查找、定位交互内容。
在一些实施例中,上述步骤101,可以包括:根据针对所述交互相关文档的操作信息和所述实时语音交互的声音信号,确定所述实时语音交互的交互分段。
在这里,实时语音交互的声音信号,按照不同的分类依据,可以分为具有不同的分类。
作为示例,如果按照是否包括语音分类,可以包括语音信号和非语音信号;如果按照声音强度分类,声音信号可以包括大于预设强度阈值的声音信号和不大于预设强度阈值的声音信号。
在一些实施例中,可以先根据预设强度阈值,检测出声音强度大于预设强度阈值的部分,然后在这部分中检测语音信号。由此,可以将声音信号分为语音信号时段和不包括语音信号的时段。
在一些实施例中,声音信号中不包括语音信号的时段可以作为交互分段的分界。以及,可以根据针对交互相关文档的操作信息,对声音信号中的语音信号时段进行分段,例如,可以将语音信号时段中的切换交互相关文档的操作,作为语音信号时段的分界点,对语音信号时段进行分段。
需要说明的是,在实时语音交互的进程中,参会用户可能在更换不同主题的内容的时候停止说话,停止说话的时段,可能指示实时语音交互的交互分段之间的分界点。由此,结合针对交互相关文档的操作信息和实时语音交互的声音信号,对实时语音交互进行分段,可以在交互分段的时候,参考操作信息和声音信号这两种能够表征交互分段分界点的信息,确定更为准确的交互分段。
在一些实施例中,上述步骤根据针对所述交互相关文档的操作信息和所述实时语音交互的声音信号,确定所述实时语音交互的交互分段,可以包括图2所示流程。图2所示流程可以包括步骤201、步骤202和步骤203。
步骤201,对实时语音交互中的语音信号时段进行语音识别,得到语音识别结果。
在这里,实时语音交互的声音信号可以包括语音信号。由此,根据时段中是否包括持续预设时长的语音信号,可以从实时语音交互中确定出语音信号时段。持续预设时长的判断中可以设置中断时长阈值,两段语音信号之间的中断时长小于中断时长阈值,则可以认为这两段语音信号为连续的语音信号,两者之间没有出现中断。
在这里,可以对实时语音交互中的语音信号时段进行语音识别,得到语音识别结果。语音识别结果可以包括文字信息。
步骤202,根据对语音识别结果的语意划分结果,对实时语音交互进行分段,得到候选分段。
在这里,可以对语音识别结果进行语意划分,得到文字信息对应的分段。分段后的每段文字信息,可以与实时语音交互的时间点对应。
在一些实施例中,上述步骤202,可以包括:对语音识别结果进行语意划分,将语音识别结果划分为至少两个分段;根据相邻两个语音识别结果之间的时间分界点,确定实时语音交互分段的分界点,得到实时语音交互的两个相邻候选分段。
作为示例,对语音识别结果进行语意划分,将语音识别结果划分为两段;由此,可以将划分为两段的语音识别结果对应的时间点,作为实时语音交互分段的分界点,得到实时语音交互的两个候选分段。 由此,可以实现对多媒体进行分段,得到候选分段。
步骤203,根据针对交互相关文档的操作信息,调整候选分段,得到交互分段。
在一些实施例中,可以根据上述操作信息,进行以下至少一项操作:将两个候选分段合并为一个交互分段,对已有候选分段的时间点进行调整,将一个候选分段分为至少两个交互分段。
需要说明的是,通过图2对应的实现方式,可以先对语音识别结果进行语意划分,得到候选分段;然后根据针对交互相关文档的操作信息调整候选分段。由此,可以提高交互分段的准确性。
在一些实施例中,上述步骤根据针对所述交互相关文档的操作信息和所述实时语音交互的声音信号,确定所述实时语音交互的交互分段,可以包括:如果声音信号中不包括语音信号的时段的持续时间大于预设第一时长阈值,则将该部分确定为第一类型交互分段。
作为示例,预设第一时长阈值的具体数值可以根据实际应用场景设置,例如可以是30秒。
在一些实施例中,如果声音信号中不包括语音信号的时段的持续时长,不大于预设第一时长阈值,那么可以将此时段并入之前或者之后的时段,或者将该时段拆分后部分与之前的时段合并,部分与之后的时段合并。
需要说明的是,通过判断不包括语音信号的时段的持续时间,可以准确找出交互中静默的时段。具体来说,交互中的声音信号可能包括语音信号或者非语音信号;对于包括非语音信号的时段,该时段即使包括声音信号,通过本实现方式的划分,也可以不参与到语音信号的分段,由此,提高根据交互分段的准确性。
在一些实施例中,上述步骤203,可以包括:根据交互相关文档的演示位置信息,确定交互相关文档的标题切换时间;根据标题切换时间,调整候选分段的起止时间。
在这里,所述标题切换时间用于指示切换交互相关文档的不同子部分的时间。
在这里,上述演示位置信息可以包括与时间绑定的文档位置信息。 演示位置信息可以文档演示至的位置。
在这里,上述演示位置信息可以根据多种方式确定。
在一些实施例中,演示位置信息可以根据以下至少一项确定:标题切换操作、文档焦点或者评论对应的文档主题信息。
在这里,标题切换操作,可以包括用户对标题中的不同条目的触发确定,还可以包括用户对交互相关文档中各级标题的触发确定。
在这里,交互相关文档的标题切换时间,可以指示标题中不同条目的切换时间。作为示例,第一节与第二节的标题切换时间,可以指示用户将文档的第一节切换至第二节的时间。
在这里,根据标题切换时间,调整候选分段的时间。
在这里,用户可能触发交互相关文档的评论,评论对应的文档主题信息,可以用于确定标题切换时间,例如从展示第一节的评论,改变为展示第二节的评论,则可以将改变时间确定为标题切换时间。
需要说明的是,通过标题切换时间,调整候选分段的起止时间,可以利用交互中对于交互展示焦点的捕捉,调整候选分段,与声音信号相结合,从声音和视觉两方面综合提高分段准确率。
请参考图3,图3示出了上述步骤102的可选的实现方式。图3所示流程可以包括步骤1011和步骤1012。
步骤1021,基于所述实时语音交互中的语音信号和/或文档切换操作,构建交互分段的层次关系。
步骤1022,展示具有层次关系的分段信息。
在这里,文档切换操作可以用于切换实时语音交互的交互相关文档。作为示例,实时语音交互的交互相关文档的数目为两个,编号为第一文档和第二文档,文档切换操作可以从第一文档切换到第二文档。
在这里,构建交互分段的分段层次,可以将交互分段以不同的层级进行展现,体现交互分段的段与段之间的关系。
例如,得到的交互分段的数目为三个,编号为第一段、第二段和第三段;构建交互分段的层级,将交互分段确定为两级,第一段和第三段属于第一级,第二段属于第一段的次一级;相应的,第一段和第三段的分段信息作为交互一级标题,第二段的分段信息作为交互二级 标题,并且此交互二级标题在第一段对应的交互一级标题之下。
在这里,展示具有层次关系的分段信息,可以包括将交互分段与交互分段之间的关系以各种形式展示出来。例如,第一级的第一段和第三段的分段信息顶格展示,第二段的分段信息在第一级的分段信息之下,缩进展示。
需要说明的是,通过基于实时语音交互的语音信号和文档切换操作,构建交互分段的层次关系,以及展示具有层次关系的分段信息,可以使得用户清楚获知交互分段之间的层次关系,方便用户了解实时语音交互的交互结构。
在一些实施例中,所述步骤1021,可以包括:响应于所述实时语音交互中没有检测到文档切换操作,基于所述交互相关文档的文档一级标题,确定实时语音交互的交互一级标题。
在这里,文档目录可以包括多级标题,文档目录中的一级标题可以称为文档一级标题。
在这里,交互可以包括多级分段,作为示例,交互一级分段可以包括交互二级分段,交互二级分段可以包括交互三级分段。交互目录可以包括交互多级标题,交互目录中的一级标题可以称为交互一级标题,指示交互一级分段。交互目录中的各级标题可以指示交互分段。可选的,交互中的各级标题可以是交互分段的分段主题,交互各级标题的层次关系与交互分段的层次关系一致。
作为示例,请参考图4,图4示出了以交互相关文档的文档一级标题为交互一级标题的场景。
图4中,播放区域401可以播放实时语音交互的交互视频。交互一级标题402可以是A文档的文档一级标题。交互二级标题403可以是A文档中的文档二级标题,并且交互二级标题403是交互一级标题402的次一级标题,在A文档中,1.1节属于第一章。交互一级标题403可以是A文档中的文档一级标题。
需要说明的是,在实时语音交互中没有文档切换的时候,即只有一个交互相关文档的时候,以交互相关文档的文档标题,确定实时语音交互的交互各级分段和交互标题,可以实现在以交互相关文档为主 线的交互中,快速而准确地确定交互进程。
在一些实施例中,如果实时语音交互中有文档切换,可以以文档标识作为一级标题,以交互相关文档的交互一级标题作为交互分段的二级标题。
在一些实施例中,所述步骤1021,包括:响应于所述实时语音交互中检测到文档切换操作,基于所述交互相关文档的文档标识,确定实时语音交互的交互一级标题;基于交互相关文档的文档标题,确定实时语音交互的交互N级标题,所述N≥2。
作为示例,请参考图5,图5示出了以交互相关文档的文档标识为交互一级标题的场景。
图5中,播放区域501可以播放实时语音交互的交互视频。交互一级标题502可以是A文档的文档标识。交互二级标题503可以是A文档中的文档一级标题,并且交互二级标题503是交互一级标题502的次一级标题,交互三级标题504是交互二级标题503的次一级标题,在A文档中,1.1节属于第一章。交互一级标题505可以是B文档的文档标识。
需要说明的是,在具有多个交互相关文档的实时语音交互中,以文档标识作为交互一级标题,可以以文档为主要节点划分实时语音交互的交互时段;根据文档标题确定实时语音交互的交互N级标题(N≥2),可以为对应同一交互相关文档的一些交互分段,借助文档分段快速确定层级关系。
在一些实施例中,所述步骤1012,可以包括:将第二类型交互分段的分段信息,作为交互一级标题进行显示。
在这里,所述第一类型交互分段期间所述实时语音交互未展示交互相关文档。
作为示例,对于所述实时语音交互中没有分享交互相关文档的交互分段,并且该交互分段大于预设第二时长阈值,则将该交互分段确定为第二类型交互分段。
作为示例,请参考图5,图5中的交互一级标题506可以指示第二类型交互分段。如图5所示,交互一级标题506可以标示讨论字样, 指示此交互分段为参会对象之间在进行讨论。
在一些实施例中,第一类型交互分段与交互相关文档的文档标识处于同一层级。
需要说明的是,将没有展示交互相关文档的交互分段,作为交互一级标题,使得用户讨论段与文档分享的交互时段并列,可以使得交互分段的层次关系更为准确合理,在需要回顾交互进程的时候,可以快速确定交互结构。
在一些实施例中,所述1021,包括:对于所述实时语音交互中的目标交互时段,根据所述目标交互时段中的语音信号,确定所述目标交互时段中是否包括第三类型交互分段;如果所述目标交互时段中包括第三类型分段,针对所述目标交互时段中的第三类型交互分段之后的交互分段,添加预设指示标识。
在一些实施例中,如果所述目标交互时段中包括第三类型分段,针对所述目标交互时段中的第三类型交互分段之后的交互分段,进行分段信息级别下调,以及将预设指示标识显示在下调后的分段信息前。在这里,所述目标交互时段中的交互分段对应于同一交互相关文档。作为示例,所述目标交互时段中的交互分段的分段主题属于同一交互相关文档。
作为示例,目标交互时段包括三个交互分段,第一个交互分段为用户之间在交流,第二个交互分段为第三类型交互分段,第三个交互分段为以A文档为对象进行讨论。
作为示例,第一个交互分段具有以下特征至少一项:在实时语音交互的录制开始阶段,多人频繁交替说话,且语音识别为短句(<30字/句),且时长>1分钟,则在视频开始进行分段打点,分段标题为导读。
作为示例,第二个交互分段具有以下特征至少一项:第一个交互分段之后,出现了>5分钟的静音段,则可以认为是读文档阶段(即第三个交互分段)。读文档阶段结束后,交互分段按照文档标题进行,并且可以在文档标题前增加一级标题“过评论”。
在这里,所述第三类型交互分段的确定条件包括语音静默时长大 于第三时长阈值(例如5分钟)。
作为示例,请参考图6,图6示出了目标交互时段中包括第三类型交互分段的示例性场景。
在图6中,播放区域601可以播放实时语音交互的交互视频。预设指示标识602(例如标示过评论字样)可以作为交互一级标题,显示在交互二级标题603之前,交互二级标题603为A文档的文档一级标题(即第一章)。交互三级标题604是交互二级标题603的次一级标题,在A文档中,1.1节属于第一章。预设指示标识605显示在交互作为交互一级标题,显示在交互二级标题606之前,交互二级标题为A文档的文档一级标题(即第二章)。
需要说明的是,通过识别目标交互时段中的第三类型交互分段,可以对于参会对象先阅读交互相关文档、之后集中讨论的这种模式的实时语音交互,准确确定沉默时段是否是与交互相关文档关联,从而确定沉默时段之后的用户交互的主要内容,并且以预设指示信息指示沉默时段之后的用户交互内容。
在一些实施例中,上述分段信息可以包括分段主题。
上述步骤1022,可以包括:根据所述实时语音交互中的交互相关文档的文档内容,确定交互分段的分段主题;展示所述分段主题。
请参考图7A,图7A示出了展示分段信息的相关场景。
在图7A中,播放区域701可以播放实时语音交互的交互视频。交互相关文档的文档内容,可以包括午饭和晚饭的选择。实时语音交互分段可以包括两个分段,第一个分段则对应文档中的标题或者文档内容概述(即中午吃什么),即分段标题702,以及分段标题702的次级标题703(标示有面条);第二个分段则对应文档中的另一个标题或者文档内容概述(即晚上吃什么),即分段标题704。
由此,确定的交互分段的分段主题,可以参考交互相关文档,可以理解,在具有交互相关文档的实时语音交互中,充分利用交互相关文档与交互进行贴合的特点,可以准确确定出交互分段的分段主题,使得用户可以根据分段主题快速了解交互进程,提高获取交互相关信息的效率。
在一些实施例中,展示所述分段主题,可以包括:在交互纪要中,展示具有层次关系的分段主题。
作为示例,请参考图7A,图7A示出了交互纪要展示区域705,交互纪要展示区域705中,可以展示分段主题,并且分段主题之间具有层次关系。
在一些实施例中,所述方法还包括:响应于针对所展示的分段主题的触发操作,将所录制的交互视频跳转至所触发的交互分段,以及播放所触发的交互分段。
作为示例,在用户触发图4中的分段标题704时,播放区域701可以播放分段标题704指示的交互分段。
由此,用户可以对照分段主题快速了解交互进程,如果希望观看实时语音交互中的分段,可以触发分段标题,快速跳转至播放触发的标题对应的分段。
在一些实施例中,所述展示所确定的交互分段的分段信息,包括以下至少一项但不限于:在实时语音交互过程中,展示所确定的交互分段的分段信息;在实时语音交互过程中和/或实时语音交互结束后,在实时语音交互对应的语音识别结果中,展示所确定的交互分段的分段信息。
需要说明的是,在语音交互过程中,展示交互分段信息,可以方便交互中的用户及时查看之前的交互结构,方便交互中的用户回忆已经交流的交互内容。
需要说明的是,在语音识别结果中展示交互分段的分段信息,可以在用户借助语音识别结果进行回忆内容的时候,直观获得交互结构,并且可以使得用户借助交互结构进一步理解语音识别结果,实现帮助用户快速获取交互内容。
在一些实施例中,所述展示所确定的交互分段的分段信息,可以包括以下至少一项但不限于:在所述实时语音交互对应的时间轴上,展示与时间点对应的分段信息;与文档内容信息关联显示所述交互分段的分段信息;与文档结构关联显示所述交互分段的分段信息。
作为示例,请参考图7B,图7B中的播放区域701可以播放实时 语音交互的交互视频,图7B展示了实时语音交互对应的时间轴706。交互相关文档的文档内容,可以包括午饭和晚饭的选择。实时语音交互分段可以包括两个分段,分别为交互的第30分钟开始的中午吃什么和第60分钟开始的晚上吃什么。在时间轴706上,可以展示与第30分钟对应的中午吃什么,并且,可以展示第60分钟对应的晚上吃什么。
在一些实施例中,文档内容信息可以用于指示文档内容。作为示例,文档内容信息可以包括文档正文、文档标题。作为示例,可以在文档正文中展示各个正文部分对应的分段时间。
在一些实施例中,文档结构可以用于指示文档的结构。作为示例,文档结构可以交互相关文档的结构。
请参考图7C,图7C的文档结构展示区707可以展示文档结构,文档结构包括指示文档第一部分的中午吃什么和指示文档第二部分的晚上吃什么。与文档第一部分的中午吃什么,可以关联展示交互分段的分段时间(即00:00-30:00);与文档第一部分的晚上吃什么,可以关联展示交互分段的分段时间(即30:01-60:00)。
请参考图8,其示出了根据本公开的基于语音交互的信息展示方法的一个实施例的流程。如图1所示该基于语音交互的信息展示方法,包括以下步骤:
步骤801,根据对所述实时语音交互中的语音信号时段进行语音识别,得到语音识别结果。
步骤802,根据所述语音识别结果,确定所述实时语音交互的交互分段。
步骤803,展示所确定的交互分段的分段信息。
需要说明的是,通过图8提供的实施例,可以根据语音识别结果,确定实时语音交互的交互分段,由此,可以利用语音识别结果可以指示不同交互分段之间,分段内容的不同,确定交互分段。由此,可以提高交互分段的准确性。
在一些实施例中,所述根据所述语音识别结果,确定所述实时语音交互的交互分段,包括:对所述实时语音交互中的语音信号时段进行语音识别,得到语音识别结果;根据对语音识别结果的语意划分结 果,对实时语音交互进行分段,得到候选分段;根据候选分段时间,确定所述交互分段。
在一些实施例中,所述根据对语音识别结果的语意划分结果,对实时语音交互进行分段,得到候选分段,包括:对语音识别结果进行语意划分,将语音识别结果划分为至少两个分段;根据相邻两个语音识别结果之间的时间分界点,确定实时语音交互分段的分界点,得到实时语音交互的两个相邻候选分段。需要说明的是,图8对应的实施例的技术特征,可以与本申请其它实施例中的任何技术特征或者技术方案结合。进一步参考图9,作为对上述各图所示方法的实现,本公开提供了一种基于语音交互的信息展示装置的一个实施例,该装置实施例与图1所示的方法实施例相对应,该装置具体可以应用于各种电子设备中。
如图9所示,本实施例的基于语音交互的信息展示装置包括:确定单元901和展示单元902。其中,确定单元,用于基于针对实时语音交互的交互相关文档的操作信息,确定所述实时语音交互的交互分段;展示单元,用于展示所确定的交互分段的分段信息。
在本实施例中,基于语音交互的信息展示装置的确定单元901和展示单元902的具体处理及其所带来的技术效果可分别参考图1对应实施例中步骤101和步骤102的相关说明,在此不再赘述。
在一些实施例中,所述基于针对实时语音交互的交互相关文档的操作信息,确定所述实时语音交互的交互分段,包括:根据针对所述交互相关文档的操作信息和所述实时语音交互的声音信号,确定所述实时语音交互的交互分段。
在一些实施例中,其中,所述实时语音交互的声音信号包括语音信号;以及所述根据针对所述交互相关文档的操作信息和所述实时语音交互的声音信号,确定所述实时语音交互的交互分段,包括:对所述实时语音交互中的语音信号时段进行语音识别,得到语音识别结果;根据对语音识别结果的语意划分结果,对实时语音交互进行分段,得到候选分段;根据针对所述交互相关文档的操作信息,调整所述候选分段时间,得到所述交互分段。
在一些实施例中,所述根据对语音识别结果的语意划分结果,对实时语音交互进行分段,得到候选分段,包括:对语音识别结果进行语意划分,将语音识别结果划分为至少两个分段;根据相邻两个语音识别结果之间的时间分界点,确定实时语音交互分段的分界点,得到实时语音交互的两个相邻候选分段。
在一些实施例中,所述根据针对所述交互相关文档的操作信息,调整所述候选分段时间,得到所述交互分段,包括:根据所述交互相关文档的演示位置信息,确定所述交互相关文档的标题切换时间;根据所述标题切换时间,调整候选分段的起止时间,所述标题切换时间用于指示切换交互相关文档的不同子部分的时间。
在一些实施例中,所述演示位置信息根据以下至少一项确定:标题切换操作、文档焦点对应的文档主题信息、当前展示评论对应的文档主题信息。
在一些实施例中,所述根据针对所述交互相关文档的操作信息,调整所述候选分段时间,得到所述交互分段,包括:响应于两个候选分段的开始时间点之间的时间间隔小于预设第一时长阈值,合并所述两个候选分段。
在一些实施例中,所述根据针对所述交互相关文档的操作信息和所述实时语音交互的声音信号,确定所述实时语音交互的交互分段,包括:如果所述声音信号中不包括语音信号的时段的持续时长大于预设第一时长阈值,则将该时段确定为第一类型交互分段。
在一些实施例中,所述展示所确定的交互分段的分段信息,包括:基于所述实时语音交互中的语音信号和/或文档切换操作,构建交互分段的层次关系;展示具有层次关系的分段信息。
在一些实施例中,所述基于所述实时语音交互中的语音信号和/或文档切换操作,构建交互分段的层次关系,包括:响应于所述实时语音交互中没有检测到文档切换操作,基于所述交互相关文档的文档一级标题,确定实时语音交互的交互一级标题。
在一些实施例中,所述基于所述实时语音交互中的语音信号和/或文档切换操作,构建交互分段的层次关系,包括:响应于所述实时 语音交互中检测到文档切换操作,基于所述交互相关文档的文档标识,确定实时语音交互的交互一级标题;基于交互相关文档的文档标题,确定实时语音交互的交互N级标题,所述N≥2。
在一些实施例中,所述展示具有层次关系的分段信息,包括:将第二类型交互分段的分段信息,作为交互一级标题进行显示,其中,所述第二类型交互分段期间所述实时语音交互未展示交互相关文档,并且第二类型交互分段的持续时长大于预设第二时长阈值。
在一些实施例中,所述基于所述实时语音交互中的语音信号和/或文档切换操作,构建交互分段的层次关系,包括:对于所述实时语音交互中的目标交互时段,根据所述目标交互时段中的语音信号,确定所述目标交互时段中是否包括第三类型交互分段,其中,所述目标交互时段对应于同一交互相关文档,所述第三类型交互分段的确定条件包括语音静默时长大于第三时长阈值;如果所述目标交互时段中包括第三类型分段,针对所述目标交互时段中的第三类型交互分段之后的交互分段,进行分段信息级别下调,以及将预设指示标识显示在下调后的分段信息前。
在一些实施例中,分段信息包括分段主题;以及所述展示具有层次关系的分段信息,包括:根据实时语音交互中的交互相关文档的文档内容,确定交互分段的分段主题;展示所述分段主题。
在一些实施例中,所述展示所述分段主题,包括:在交互纪要中,展示具有层次关系的分段主题。
在一些实施例中,所述装置还用于:响应于针对所述分段主题的触发操作,将所录制的交互视频跳转至所述触发的交互分段,以及播放所触发的交互分段。
进一步参考图10,作为对上述各图所示方法的实现,本公开提供了一种基于语音交互的信息展示装置的一个实施例,该装置实施例与图8所示的方法实施例相对应,该装置具体可以应用于各种电子设备中。
如图10所示,本实施例的基于语音交互的信息展示装置包括:识别模块1001、确定模块1002和展示单元1003。其中,识别模块,用 于根据对所述实时语音交互中的语音信号时段进行语音识别,得到语音识别结果;确定模块,用于根据所述语音识别结果,确定所述实时语音交互的交互分段;展示模块,用于展示所确定的交互分段的分段信息。
在一些实施例中,所述根据所述语音识别结果,确定所述实时语音交互的交互分段,包括:对所述实时语音交互中的语音信号时段进行语音识别,得到语音识别结果;根据对语音识别结果的语意划分结果,对实时语音交互进行分段,得到候选分段;根据候选分段时间,确定所述交互分段。
在一些实施例中,所述根据对语音识别结果的语意划分结果,对实时语音交互进行分段,得到候选分段,包括:对语音识别结果进行语意划分,将语音识别结果划分为至少两个分段;根据相邻两个语音识别结果之间的时间分界点,确定实时语音交互分段的分界点,得到实时语音交互的两个相邻候选分段。请参考图11,图11示出了本公开的一个实施例的基于语音交互的信息展示方法可以应用于其中的示例性系统架构。
如图11所示,系统架构可以包括终端设备1101、1102、1103,网络1104,服务器1105。网络1104用以在终端设备1101、1102、1103和服务器1105之间提供通信链路的介质。网络1104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。
终端设备1101、1102、1103可以通过网络1104与服务器1105交互,以接收或发送消息等。终端设备1101、1102、1103上可以安装有各种客户端应用,例如网页浏览器应用、搜索类应用、新闻资讯类应用。终端设备1101、1102、1103中的客户端应用可以接收用户的指令,并根据用户的指令完成相应的功能,例如根据用户的指令在信息中添加相应信息。
终端设备1101、1102、1103可以是硬件,也可以是软件。当终端设备1101、1102、1103为硬件时,可以是具有显示屏并且支持网页浏览的各种电子设备,包括但不限于智能手机、平板电脑、电子书阅读器、MP3播放器(Moving Picture Experts Group Audio Layer III,动态 影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、膝上型便携计算机和台式计算机等等。当终端设备1101、1102、1103为软件时,可以安装在上述所列举的电子设备中。其可以实现成多个软件或软件模块(例如用来提供分布式服务的软件或软件模块),也可以实现成单个软件或软件模块。在此不做具体限定。
服务器1105可以是提供各种服务的服务器,例如接收终端设备1101、1102、1103发送的信息获取请求,根据信息获取请求通过各种方式获取信息获取请求对应的展示信息。并展示信息的相关数据发送给终端设备1101、1102、1103。
需要说明的是,本公开实施例所提供的基于语音交互的信息展示方法可以由终端设备执行,相应地,基于语音交互的信息展示装置可以设置在终端设备1101、1102、1103中。此外,本公开实施例所提供的基于语音交互的信息展示方法还可以由服务器1105执行,相应地,基于语音交互的信息展示装置可以设置于服务器1105中。
应该理解,图11中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。
下面参考图12,其示出了适于用来实现本公开实施例的电子设备(例如图11中的终端设备或服务器)的结构示意图。本公开实施例中的终端设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。图12示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图12所示,电子设备可以包括处理装置(例如中央处理器、图形处理器等)1201,其可以根据存储在只读存储器(ROM)1202中的程序或者从存储装置1208加载到随机访问存储器(RAM)1203中的程序而执行各种适当的动作和处理。在RAM 1203中,还存储有电子设备操作所需的各种程序和数据。处理装置1201、ROM 1202以及RAM  1203通过总线1204彼此相连。输入/输出(I/O)接口1205也连接至总线1204。
通常,以下装置可以连接至I/O接口1205:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置1206;包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置1207;包括例如磁带、硬盘等的存储装置1208;以及通信装置1209。通信装置1209可以允许电子设备与其他设备进行无线或有线通信以交换数据。虽然图12示出了具有各种装置的电子设备,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置1209从网络上被下载和安装,或者从存储装置1208被安装,或者从ROM 1202被安装。在该计算机程序被处理装置1201执行时,执行本公开实施例的方法中限定的上述功能。
需要说明的是,本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号, 其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。
在一些实施方式中,客户端、服务器可以利用诸如HTTP(HyperText Transfer Protocol,超文本传输协议)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(“LAN”),广域网(“WAN”),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:基于针对实时语音交互的交互相关文档的操作信息,确定所述实时语音交互的交互分段;展示所确定的交互分段的分段信息。
在一些实施例中,所述基于针对实时语音交互的交互相关文档的操作信息,确定所述实时语音交互的交互分段,包括:根据针对所述交互相关文档的操作信息和所述实时语音交互的声音信号,确定所述实时语音交互的交互分段。
在一些实施例中,其中,所述实时语音交互的声音信号包括语音信号;以及所述根据针对所述交互相关文档的操作信息和所述实时语音交互的声音信号,确定所述实时语音交互的交互分段,包括:对所述实时语音交互中的语音信号时段进行语音识别,得到语音识别结果;根据对语音识别结果的语意划分结果,对实时语音交互进行分段,得到候选分段;根据针对所述交互相关文档的操作信息,调整所述候选 分段时间,得到所述交互分段。
在一些实施例中,所述根据针对所述交互相关文档的操作信息,调整所述候选分段时间,得到所述交互分段,包括:根据所述交互相关文档的演示位置信息,确定所述交互相关文档的标题切换时间;根据所述标题切换时间,调整候选分段的起止时间。
在一些实施例中,所述演示位置信息根据以下至少一项确定:标题切换操作、文档焦点对应的文档主题信息、当前展示评论对应的文档主题信息。
在一些实施例中,所述根据针对所述交互相关文档的操作信息,调整所述候选分段时间,得到所述交互分段,包括:响应于两个候选分段的开始时间点之间的时间间隔小于预设第一时长阈值,合并所述两个候选分段。
在一些实施例中,所述根据针对所述交互相关文档的操作信息和所述实时语音交互的声音信号,确定所述实时语音交互的交互分段,包括:如果所述声音信号中不包括语音信号的时段的持续时长大于预设第一时长阈值,则将该时段确定为第一类型交互分段。
在一些实施例中,所述展示所确定的交互分段的分段信息,包括:基于所述实时语音交互中的语音信号和/或文档切换操作,构建交互分段的层次关系;展示具有层次关系的分段信息。
在一些实施例中,所述基于所述实时语音交互中的语音信号和/或文档切换操作,构建交互分段的层次关系,包括:响应于所述实时语音交互中没有检测到文档切换操作,基于所述交互相关文档的文档一级标题,确定实时语音交互的交互一级标题。
在一些实施例中,所述基于所述实时语音交互中的语音信号和/或文档切换操作,构建交互分段的层次关系,包括:响应于所述实时语音交互中检测到文档切换操作,基于所述交互相关文档的文档标识,确定实时语音交互的交互一级标题;基于交互相关文档的文档标题,确定实时语音交互的交互N级标题,所述N≥2。
在一些实施例中,所述展示具有层次关系的分段信息,包括:将第二类型交互分段的分段信息,作为交互一级标题进行显示,其中, 所述第二类型交互分段期间所述实时语音交互未展示交互相关文档,并且第二类型交互分段的持续时长大于预设第二时长阈值。
在一些实施例中,所述基于所述实时语音交互中的语音信号和/或文档切换操作,构建交互分段的层次关系,包括:对于所述实时语音交互中的目标交互时段,根据所述目标交互时段中的语音信号,确定所述目标交互时段中是否包括第三类型交互分段,其中,所述目标交互时段对应于同一交互相关文档,所述第三类型交互分段的确定条件包括语音静默时长大于第三时长阈值;如果所述目标交互时段中包括第三类型分段,针对所述目标交互时段中的第三类型交互分段之后的交互分段,进行分段信息级别下调,以及将预设指示标识显示在下调后的分段信息前。
在一些实施例中,分段信息包括分段主题;以及所述展示具有层次关系的分段信息,包括:根据实时语音交互中的交互相关文档的文档内容,确定交互分段的分段主题;展示所述分段主题。
在一些实施例中,所述展示所述分段主题,包括:在交互纪要中,展示具有层次关系的分段主题。
在一些实施例中,所述电子设备还用于:响应于针对所述分段主题的触发操作,将所录制的交互视频跳转至所述触发的交互分段,以及播放所触发的交互分段。
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:根据对所述实时语音交互中的语音信号时段进行语音识别,得到语音识别结果;根据所述语音识别结果,确定所述实时语音交互的交互分段;展示所确定的交互分段的分段信息。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括但不限于面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或 者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,单元的名称在某种情况下并不构成对该单元本身的限定,例如,选取单元还可以被描述为“选取第一类型像素的单元”。
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上系统(SOC)、复杂可编程逻辑设备(CPLD)等等。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基 于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。
以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。
此外,虽然采用特定次序描绘了各操作,但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实施例中。相反地,在单个实施例的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。
尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。

Claims (26)

  1. 一种基于语音交互的信息展示方法,其特征在于,包括:
    基于针对实时语音交互的交互相关文档的操作信息,确定所述实时语音交互的交互分段;
    展示所确定的交互分段的分段信息。
  2. 根据权利要求1所述的方法,其特征在于,所述基于针对实时语音交互的交互相关文档的操作信息,确定所述实时语音交互的交互分段,包括:
    根据针对所述交互相关文档的操作信息和所述实时语音交互的声音信号,确定所述实时语音交互的交互分段。
  3. 根据权利要求2所述的方法,其特征在于,所述实时语音交互的声音信号包括语音信号;以及
    所述根据针对所述交互相关文档的操作信息和所述实时语音交互的声音信号,确定所述实时语音交互的交互分段,包括:
    对所述实时语音交互中的语音信号时段进行语音识别,得到语音识别结果;
    根据对语音识别结果的语意划分结果,对实时语音交互进行分段,得到候选分段;
    根据针对所述交互相关文档的操作信息,调整所述候选分段时间,得到所述交互分段。
  4. 根据权利要求3所述的方法,其特征在于,所述根据对语音识别结果的语意划分结果,对实时语音交互进行分段,得到候选分段,包括:
    对语音识别结果进行语意划分,将语音识别结果划分为至少两个分段;
    根据相邻两个语音识别结果之间的时间分界点,确定实时语音交 互分段的分界点,得到实时语音交互的两个相邻候选分段。
  5. 根据权利要求2所述的方法,其特征在于,所述根据针对所述交互相关文档的操作信息,调整所述候选分段时间,得到所述交互分段,包括:
    根据所述交互相关文档的演示位置信息,确定所述交互相关文档的标题切换时间,其中,所述标题切换时间用于指示切换交互相关文档的不同子部分的时间;
    根据所述标题切换时间,调整候选分段的起止时间。
  6. 根据权利要求5所述的方法,其特征在于,所述演示位置信息根据以下至少一项确定:标题切换操作、文档焦点对应的文档主题信息、当前展示评论对应的文档主题信息。
  7. 根据权利要求3所述的方法,其特征在于,所述根据针对所述交互相关文档的操作信息,调整所述候选分段时间,得到所述交互分段,包括:
    响应于两个候选分段的开始时间点之间的时间间隔小于预设第一时长阈值,合并所述两个候选分段。
  8. 根据权利要求2所述的方法,其特征在于,所述根据针对所述交互相关文档的操作信息和所述实时语音交互的声音信号,确定所述实时语音交互的交互分段,包括:
    如果所述声音信号中不包括语音信号的时段的持续时长大于预设第一时长阈值,则将该时段确定为第一类型交互分段。
  9. 根据权利要求1-8中任一项所述的方法,其特征在于,所述展示所确定的交互分段的分段信息,包括:
    基于所述实时语音交互中的语音信号和/或文档切换操作,构建交互分段的层次关系;
    展示具有层次关系的分段信息。
  10. 根据权利要求9所述的方法,其特征在于,所述基于所述实时语音交互中的语音信号和/或文档切换操作,构建交互分段的层次关系,包括:
    响应于所述实时语音交互中没有检测到文档切换操作,基于所述交互相关文档的文档一级标题,确定实时语音交互的交互一级标题,其中,交互各级标题对应交互各级分段。
  11. 根据权利要求9所述的方法,其特征在于,所述基于所述实时语音交互中的语音信号和/或文档切换操作,构建交互分段的层次关系,包括:
    响应于所述实时语音交互中检测到文档切换操作,基于所述交互相关文档的文档标识,确定实时语音交互的交互一级标题;
    基于交互相关文档的文档标题,确定实时语音交互的交互N级标题,所述N≥2,其中,交互各级标题对应交互各级分段。
  12. 根据权利要求9所述的方法,其特征在于,所述展示具有层次关系的分段信息,包括:
    将第二类型交互分段的分段信息,作为交互一级标题进行显示,其中,所述第二类型交互分段期间所述实时语音交互未展示交互相关文档,并且第二类型交互分段的持续时长大于预设第二时长阈值。
  13. 根据权利要求9所述的方法,其特征在于,所述基于所述实时语音交互中的语音信号和/或文档切换操作,构建交互分段的层次关系,包括:
    对于所述实时语音交互中的目标交互时段,根据所述目标交互时段中的语音信号,确定所述目标交互时段中是否包括第三类型交互分段,其中,所述目标交互时段中的交互分段的分段主题属于同一交互相关文档,所述第三类型交互分段的确定条件包括语音静默时长大于 第三时长阈值;
    如果所述目标交互时段中包括第三类型分段,针对所述目标交互时段中的第三类型交互分段之后的交互分段,添加预设指示标识。
  14. 根据权利要求1所述的方法,其特征在于,分段信息包括分段主题;以及
    所述展示具有层次关系的分段信息,包括:
    根据实时语音交互中的交互相关文档的文档内容,确定交互分段的分段主题;
    展示所述分段主题。
  15. 根据权利要求14所述的方法,其特征在于,所述展示所述分段主题,包括:
    在交互纪要中,展示具有层次关系的分段主题。
  16. 根据权利要求14所述的方法,其特征在于,所述方法还包括:
    响应于针对所述分段主题的触发操作,将所录制的交互视频跳转至所述触发的交互分段,以及播放所触发的交互分段。
  17. 根据权利要求1所述的方法,其特征在于,所述展示所确定的交互分段的分段信息,包括以下至少一项:
    在实时语音交互过程中,展示所确定的交互分段的分段信息;
    在实时语音交互过程中和/或实时语音交互结束后,在实时语音交互对应的语音识别结果中,展示所确定的交互分段的分段信息。
  18. 根据权利要求1所述的方法,其特征在于,所述展示所确定的交互分段的分段信息,包括以下至少一项:
    在所述实时语音交互对应的时间轴上,展示与时间点对应的分段信息;
    与文档内容信息关联显示所述交互分段的分段信息;
    与文档结构关联显示所述交互分段的分段信息,其中,分段信息包括分段时间。
  19. 根据权利要求1所述的方法,其特征在于,所述交互相关文档,包括实时语音交互过程中共享的文档。
  20. 一种基于语音交互的信息展示方法,其特征在于,包括:
    根据对所述实时语音交互中的语音信号时段进行语音识别,得到语音识别结果;
    根据所述语音识别结果,确定所述实时语音交互的交互分段;
    展示所确定的交互分段的分段信息。
  21. 根据权利要求20所述的方法,其特征在于,所述根据所述语音识别结果,确定所述实时语音交互的交互分段,包括:
    对所述实时语音交互中的语音信号时段进行语音识别,得到语音识别结果;
    根据对语音识别结果的语意划分结果,对实时语音交互进行分段,得到候选分段;
    根据候选分段时间,确定所述交互分段。
  22. 根据权利要求21所述的方法,其特征在于,所述根据对语音识别结果的语意划分结果,对实时语音交互进行分段,得到候选分段,包括:
    对语音识别结果进行语意划分,将语音识别结果划分为至少两个分段;
    根据相邻两个语音识别结果之间的时间分界点,确定实时语音交互分段的分界点,得到实时语音交互的两个相邻候选分段。
  23. 一种基于语音交互的信息展示装置,其特征在于,包括:
    确定单元,用于基于针对实时语音交互的交互相关文档的操作信息,确定所述实时语音交互的交互分段;
    展示单元,用于展示所确定的交互分段的分段信息。
  24. 一种基于语音交互的信息展示装置,其特征在于,包括:
    识别模块,用于根据对所述实时语音交互中的语音信号时段进行语音识别,得到语音识别结果;
    确定模块,用于根据所述语音识别结果,确定所述实时语音交互的交互分段;
    展示模块,用于展示所确定的交互分段的分段信息。
  25. 一种电子设备,其特征在于,包括:
    一个或多个处理器;
    存储装置,用于存储一个或多个程序,
    当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-22中任一所述的方法。
  26. 一种计算机可读介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现如权利要求1-22中任一所述的方法。
PCT/CN2023/113531 2022-10-31 2023-08-17 基于语音交互的信息展示方法、装置和电子设备 WO2024093443A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211351762.7 2022-10-31
CN202211351762.7A CN115547330A (zh) 2022-10-31 2022-10-31 基于语音交互的信息展示方法、装置和电子设备

Publications (1)

Publication Number Publication Date
WO2024093443A1 true WO2024093443A1 (zh) 2024-05-10

Family

ID=84719251

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/113531 WO2024093443A1 (zh) 2022-10-31 2023-08-17 基于语音交互的信息展示方法、装置和电子设备

Country Status (2)

Country Link
CN (1) CN115547330A (zh)
WO (1) WO2024093443A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115547330A (zh) * 2022-10-31 2022-12-30 北京字跳网络技术有限公司 基于语音交互的信息展示方法、装置和电子设备

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20200011198A (ko) * 2018-07-24 2020-02-03 주식회사 리턴제로 대화형 메시지 구현 방법, 장치 및 프로그램
CN110989889A (zh) * 2019-12-20 2020-04-10 联想(北京)有限公司 信息展示方法、信息展示装置和电子设备
CN112562665A (zh) * 2020-11-30 2021-03-26 武汉海昌信息技术有限公司 一种基于信息交互的语音识别方法、存储介质及系统
CN113014854A (zh) * 2020-04-30 2021-06-22 北京字节跳动网络技术有限公司 互动记录的生成方法、装置、设备及介质
CN114168710A (zh) * 2021-12-08 2022-03-11 北京百度网讯科技有限公司 一种会议记录的生成方法、装置、系统、设备及存储介质
CN114936001A (zh) * 2022-04-14 2022-08-23 阿里巴巴(中国)有限公司 交互方法、装置及电子设备
CN115547330A (zh) * 2022-10-31 2022-12-30 北京字跳网络技术有限公司 基于语音交互的信息展示方法、装置和电子设备

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20200011198A (ko) * 2018-07-24 2020-02-03 주식회사 리턴제로 대화형 메시지 구현 방법, 장치 및 프로그램
CN110989889A (zh) * 2019-12-20 2020-04-10 联想(北京)有限公司 信息展示方法、信息展示装置和电子设备
CN113014854A (zh) * 2020-04-30 2021-06-22 北京字节跳动网络技术有限公司 互动记录的生成方法、装置、设备及介质
CN112562665A (zh) * 2020-11-30 2021-03-26 武汉海昌信息技术有限公司 一种基于信息交互的语音识别方法、存储介质及系统
CN114168710A (zh) * 2021-12-08 2022-03-11 北京百度网讯科技有限公司 一种会议记录的生成方法、装置、系统、设备及存储介质
CN114936001A (zh) * 2022-04-14 2022-08-23 阿里巴巴(中国)有限公司 交互方法、装置及电子设备
CN115547330A (zh) * 2022-10-31 2022-12-30 北京字跳网络技术有限公司 基于语音交互的信息展示方法、装置和电子设备

Also Published As

Publication number Publication date
CN115547330A (zh) 2022-12-30

Similar Documents

Publication Publication Date Title
WO2022042593A1 (zh) 字幕编辑方法、装置和电子设备
WO2023051102A1 (zh) 视频推荐方法、装置、设备及介质
US8799300B2 (en) Bookmarking segments of content
US20170199943A1 (en) User interface for multivariate searching
US20190130185A1 (en) Visualization of Tagging Relevance to Video
CN111858974B (zh) 信息推送方法、装置、电子设备及存储介质
WO2021259300A1 (zh) 音效添加方法和装置、存储介质和电子设备
WO2024093443A1 (zh) 基于语音交互的信息展示方法、装置和电子设备
WO2023151589A1 (zh) 视频显示方法、装置、电子设备和存储介质
WO2023279843A1 (zh) 内容搜索方法、装置、设备和存储介质
US20230229382A1 (en) Method and apparatus for synchronizing audio and text, readable medium, and electronic device
US11853353B2 (en) Music pushing method, apparatus, electronic device and storage medium
WO2022105760A1 (zh) 一种多媒体浏览方法、装置、设备及介质
WO2023016349A1 (zh) 一种文本输入方法、装置、电子设备和存储介质
US20240061899A1 (en) Conference information query method and apparatus, storage medium, terminal device, and server
WO2024016902A1 (zh) 多媒体播放方法、设备、存储介质及程序产品
CN112291614A (zh) 一种视频生成方法及装置
WO2021218680A1 (zh) 互动信息处理方法、装置、电子设备及存储介质
US20240121485A1 (en) Method, apparatus, device, medium and program product for obtaining text material
CN117171406A (zh) 应用程序功能的推荐方法、装置、设备和存储介质
WO2022257777A1 (zh) 多媒体处理方法、装置、设备及介质
CN116049490A (zh) 素材搜索方法、装置和电子设备
TWI739633B (zh) 儲存和讀取方法、電子設備和電腦可讀儲存介質
CN115167966A (zh) 基于歌词的信息提示方法、装置、设备、介质及产品
CN111339770B (zh) 用于输出信息的方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23884382

Country of ref document: EP

Kind code of ref document: A1