CN114936001A

CN114936001A - Interaction method and device and electronic equipment

Info

Publication number: CN114936001A
Application number: CN202210390594.6A
Authority: CN
Inventors: 黄园园; 姚雪萍; 沈梦飞; 芮琳; 林居颖; 郭静雅; 郑坤坤
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2022-04-14
Filing date: 2022-04-14
Publication date: 2022-08-23

Abstract

The embodiment of the application discloses an interaction method, an interaction device and electronic equipment, wherein the method comprises the following steps: providing a conference summary interactive interface in the process of an audio and video conference, wherein the conference summary interactive interface comprises an original text display area and a summary display area, the original text display area is used for displaying text contents obtained by real-time conversion according to voice contents acquired in the process of the conference, and the summary display area is used for displaying the conference summary contents; and responding to the interactive operation executed by the user through the conference summary interactive interface so as to generate the conference summary content and display the content to the summary display area. Through the embodiment of the application, the efficiency of obtaining information by a user can be improved.

Description

Interaction method and device and electronic equipment

Technical Field

The present application relates to the field of audio and video conferencing technologies, and in particular, to an interaction method, an interaction device, and an electronic device.

Background

With the informatization development of the internet and the mobile internet, more and more conferences are selected to be held on line. The online conference (including video conference, telephone conference, etc.) is an efficient remote communication tool, and all participants in different places can be centralized together through a network, so that multi-mode communication of audio and video communication, data interaction, remote sharing, assistance and the like is realized. Therefore, the online conference breaks through the time and space limitations, meets various requirements of modern conferences, and is more used to acquire related conference information through the internet nowadays, so that the online conference becomes a trend of related industries.

In order to meet the requirements of users, many online meeting tools provide a function of recording the meeting process, and can also transcribe the voice into characters after the meeting is ended, so as to help the users to review the meeting content or help the users who fail to participate in the meeting to acquire meeting information. However, since a one-time online meeting time may be long, even if the meeting is transcribed into characters, it may take a lot of time in the process of review, so that the efficiency of obtaining information by the user is low.

Disclosure of Invention

The application provides an interaction method, an interaction device and electronic equipment, which can improve the efficiency of obtaining information by a user.

The application provides the following scheme:

an interaction method, comprising:

providing a conference summary interactive interface in the process of an audio and video conference, wherein the conference summary interactive interface comprises an original text display area and a summary display area, the original text display area is used for displaying text contents obtained by real-time conversion according to voice contents acquired in the process of the conference, and the summary display area is used for displaying the conference summary contents;

and responding to the interactive operation executed by the user through the conference summary interactive interface so as to generate the conference summary content and display the content to the summary display area.

Wherein the responding to the interactive operation executed by the user through the conference summary interactive interface comprises:

receiving a text input operation performed by a user through the summary presentation area to determine the received input content as the conference summary content.

and receiving an operation which is executed by a user and is used for selecting target text content and extracting the target text content into conference summary content through the original text display area so as to determine the target text content into the conference summary content and display the conference summary content in the summary display area.

receiving the operation of selecting and marking the target text content executed by a user through the original text display area, and recording the position information of the marked target text content in the original text;

providing operation options for screening based on marks in the conference summary interactive interface;

and after the screening operation of the user is received through the operation options, positioning an original text position corresponding to the target text content in the original text display area for display so as to input the conference summary content according to the original text corresponding to the target text content.

Wherein, still include:

recording the position information of the marked target text content on a corresponding audio time axis;

after the audio and video conference is finished and a request for displaying the conference summary interactive interface is received, providing an audio playing control bar in the conference summary interactive interface according to the audio time axis, and displaying a mark at a position corresponding to the position information in the audio playing control bar;

and after receiving an operation of marking a mark in the audio playing control bar, positioning the original text position corresponding to the target text content in the original text display area for display.

Wherein, still include:

and after the audio and video conference is finished and a request for displaying the conference summary interactive interface is received, providing the conference summary content generated through an intelligent algorithm in the conference summary interactive interface.

Wherein the conference summary content generated by the intelligent algorithm comprises: keywords for the meeting, agenda directory, focus content, and/or backlog information.

Wherein the agenda directory is generated by:

performing natural language understanding on a plurality of text contents obtained by conversion, and dividing the plurality of text contents into a plurality of paragraphs after aggregating according to sentence relevance;

extracting text content representing the topic of the paragraph from each paragraph respectively to generate agenda topics of a plurality of agendas to be displayed in the agenda catalog.

Wherein, still include:

recording the position information of the paragraph corresponding to the agenda topic in the original text;

after receiving the operation of one agenda subject in the agenda directory, positioning the position of the paragraph corresponding to the agenda subject in the original text display area to display the paragraph in the original text.

Wherein, still include:

recording time period information corresponding to the paragraph corresponding to the agenda topic on an audio time axis;

providing an audio playing control bar in the conference summary interactive interface according to the audio time axis, and displaying the audio playing control bar in segments, wherein the position and the length of each segment are related to the time period information;

after receiving an operation on one of the target segments on the audio playing control bar, determining a target agenda topic corresponding to the target segment;

and positioning the paragraph corresponding to the target agenda subject in the original text display area to be displayed at the position in the original text, and positioning the corresponding audio data to the time section corresponding to the target agenda subject to be played.

An interaction device, comprising:

the audio and video conference system comprises an interactive interface providing unit, a conference summary interactive interface providing unit and a conference summary interactive interface providing unit, wherein the conference summary interactive interface comprises an original text display area and a summary display area, the original text display area is used for displaying text contents obtained by real-time conversion according to voice contents acquired in the conference process, and the summary display area is used for displaying the conference summary contents;

and the interactive operation receiving unit is used for responding to the interactive operation executed by the user through the conference summary interactive interface so as to generate the conference summary content and display the content to the summary display area.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any of the preceding claims.

An electronic device, comprising:

one or more processors; and

a memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform the steps of the method of any of the preceding claims.

According to the specific embodiments provided by the application, the application discloses the following technical effects:

by the embodiment of the application, in the process of the audio and video conference, real-time voice-text conversion can be performed, and the text content obtained through conversion is displayed in the original text display area of the conference document interactive interface. In addition, the conference summary interactive interface can further comprise a summary display area, so that in the process of a conference, specific conference summary content can be determined through interactive operation (including direct input in the summary display area or one-key extraction of target text content selected from the original text display area) initiated by a user based on the specific conference summary interactive interface, and the specific conference summary content is displayed in the summary display area. Therefore, the method can help the user to arrange the conference summary in time in the process of participating in the audio and video conference, is convenient for quickly reviewing the conference summary content after meeting, and improves the efficiency of obtaining information by the user.

In addition, the conference summary content can be automatically generated through an intelligent algorithm and can be mutually supplemented with the conference summary content input or extracted by the user, so that the user is helped to improve the information obtaining efficiency and more comprehensive information is provided for the user.

Moreover, the marked points are displayed on the audio control bar, or the audio control bar is divided into a plurality of segments for displaying according to a plurality of automatically identified agenda paragraphs, and the like, so that the user can be helped to backtrack the marked contents or a specific topic more efficiently.

Of course, it is not necessary for any product to achieve all of the above-described advantages at the same time for the practice of the present application.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a schematic diagram of a system architecture provided by an embodiment of the present application;

FIG. 2 is a flow chart of a method provided by an embodiment of the present application;

3-1, 3-2 are schematic diagrams of summary interactive interfaces in a conference process provided by the embodiment of the present application;

4-1 through 4-3 are schematic diagrams of a summary interactive interface after a conference is ended according to an embodiment of the application;

FIG. 5 is a schematic view of an apparatus provided by an embodiment of the present application;

fig. 6 is a schematic diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments that can be derived from the embodiments given herein by a person of ordinary skill in the art are intended to be within the scope of the present disclosure.

In order to facilitate understanding of the technical solutions provided in the embodiments of the present application, it should be noted that, in the prior art, there are some audio and video conference tools that can provide a conference summary function, that is, help a user to sort out an agenda, important contents, and the like in a conference, so that when reviewing the contents of an audio and video conference, the user can quickly browse according to the conference summary contents, and the efficiency of obtaining information by the user is improved. However, in such existing tools, after the conference is finished, the recording result of the conference is mainly displayed through the relevant interface, and can be converted into text content, so that the user can sort out the conference summary according to the text content, so as to facilitate subsequent review or help other people to quickly know the main content of the conference. That is, in the prior art, a user needs to generate a conference summary after a specific audio/video conference is finished.

In the embodiment of the application, in order to further help a user generate a conference summary in time, a specific conference summary interactive interface can be provided in the process of an audio and video conference, so that the user can participate in the conference and generate the conference summary at the same time, for example, specific conference summary content can be input in the conference summary interactive interface, and the effect of 'note taking' while starting the conference is achieved. Moreover, real-time voice recognition can be carried out on voice data generated in the audio and video conference, the voice data are converted into text contents and are displayed in the conference summary interactive interface, and therefore a user can rapidly generate a conference summary in modes of picking from the text contents and the like. In this way, the generation of the meeting summary can be generated quickly as the meeting progresses, without waiting for the generation of the meeting summary to proceed after the meeting ends. In addition, it is considered that the following may exist in practical applications: some text contents converted in the conference process may have more characters, and a user pays attention to the text contents, but the text contents which can be directly extracted into the summary may not exist, and the text contents need to be sorted and then written into the summary contents. Therefore, for this situation, the embodiment of the present application may further provide a function of marking the text content obtained by conversion, and may record information such as a position of the marked content on the text content original text and the voice time axis, so that a user may find the corresponding text content original text by screening or backtracking the marked content after the meeting is ended or at other occasions, and then arrange the specific meeting summary content accordingly. In addition, the original text of the recognized text content can be modified, and some common words can be added to improve the accuracy of the text content converted in the voice recognition process, and the like.

Except that the functions of inputting and extracting the meeting summary and the like can be provided for the user in the process of meeting or after the meeting is finished, the embodiment of the application can also provide the function of intelligently generating the meeting summary. That is, besides the user can input or extract some meeting summary content by himself, the specific tool end can automatically generate some meeting summary content through an intelligent algorithm. For example, it may include extracting some keywords, agenda directory, focus content, backlog, etc. Therefore, the conference summary contents in two aspects can be mutually supplemented, so that the main information of the conference can be quickly obtained, and the information can be more comprehensive. In addition, as for the conference summary automatically generated by the algorithm, the position of the conference summary in the text content original text, the position of the conference summary on the video time axis and the like can be recorded, and thus, the original text or the specific video can be traced back according to the summary content.

When the conference recording interface is displayed after the conference is finished, the display style of the audio playing control bar can be improved according to the information such as the agenda and the like automatically identified by the intelligent algorithm according to the mark information added by the user in the conference process, for example, the marked position can be displayed on the audio playing control bar, the audio playing control bar can be displayed into a plurality of segments, each segment corresponds to one agenda theme, and the user can conveniently backtrack the specific mark content or the agenda theme on the audio bar, and the like.

From the perspective of system architecture, referring to fig. 1, the embodiment of the present application may provide a specific conference summary function in an audio and video conference application, where the application may be implemented in the form of a client and a server. The client may exist in a form of a web page or an independent application, and in a preferred embodiment, a specific client may be run on a terminal device such as a notebook computer, so as to facilitate a user to perform operations such as inputting and extracting a meeting summary. The interface display related function can be realized through the client, for example, when an audio and video conference is started, the user can be inquired whether a real-time conference summary function needs to be started, if the real-time conference summary function is started, a corresponding conference summary interactive interface can be provided, and after the conference is finished, an entrance for displaying the conference summary interactive interface can also be provided, and the like. And the real-time voice recognition in the conference process, the intelligent generation of the conference summary content after the conference is finished and the like can be finished through the server side.

The following describes in detail specific implementations provided in embodiments of the present application.

First, the embodiment of the present application provides an interaction method from the perspective of the aforementioned client, and referring to fig. 2, the method may include:

s201: in the process of carrying out the audio and video conference, a conference summary interaction interface is provided, the conference summary interaction interface comprises an original text display area and a summary display area, the original text display area is used for displaying text contents obtained by real-time conversion according to voice contents acquired in the process of carrying out the conference, and the summary display area is used for displaying the conference summary contents.

In this embodiment of the present application, the specific real-time conference summary request may be initiated by a participant user of the audio/video conference, that is, assuming that after the audio/video conference tool provided in this embodiment of the present application initiates an audio/video conference once, the participant user of the conference has N people, any one of the users may initiate a request of the real-time conference summary before the conference starts or during the conference (an operation option for initiating a request of the real-time conference may be provided in a relevant interface of the client). That is to say, each participant user may need to record some key contents in the conference during the conference, so as to review after the conference is finished. In other words, each participant user may initiate a real-time conference summary request in the client used in their respective participation and generate a respective conference summary, respectively.

After receiving a request of a real-time conference summary submitted by a certain user, a conference summary interactive interface may be provided in a client of the user during an audio and video conference, and referring to fig. 3-1, the conference summary interactive interface may include two regions, of which 31 is shown as an original text display region, for displaying an original text of text content obtained by real-time speech recognition conversion during the conference. In specific implementation, speakers of specific voice content can be identified, so that the text content original text can be multiple, and each text content original text corresponds to a text content identification result of a single speaker speaking. If the information such as the voiceprint of the conference participants is pre-stored, the information such as the names of specific speakers can be determined through voiceprint matching and the like and displayed in the original text display area, otherwise, the speakers can be directly marked as the speakers 1 and 2 according to different sound characteristics. Additionally, a summary presentation area, shown at 32, may also be included in the meeting summary interaction interface, which may be used to present specific meeting summary content.

The text content original text can be specifically provided by a server, that is, in the process of the conference, the audio data generated in the conference can be submitted to the server, and the text content is converted by the server according to a specific streaming voice recognition algorithm and then distributed to each participant user client needing real-time summary service, so as to be displayed in an original text display area in a conference summary interactive interface.

The specific speech recognition can be realized by adopting various streaming speech recognition algorithms, as long as the speech recognition can be performed on the speech stream generated in real time and converted into text content in real time. Certainly, in the embodiment of the application, an operation option for adding common words may be provided in the conference summary interaction interface, so that the user may input and store some common words (including names of people and the like) in the current conference, and thus, when the server performs speech recognition, the server may use the common words to improve the accuracy of the speech recognition.

In addition, in the embodiment of the application, the cursor can be directly moved to the original text display area so as to modify the text content original text identified by the algorithm. That is to say, because the algorithm recognition result may have some inaccurate conditions, in the prior art, the text content recognized by the algorithm is not editable, and in the embodiment of the present application, the modification operation of the user may be received in the process of recognizing and displaying the original text of the specific text content in real time, so that the content recognized incorrectly in the text content may be corrected in time. In specific implementation, after a certain user modifies the text content original text, the modified information can be submitted to the server, so that the server can synchronize the modified information to other user clients participating in a conference and using a real-time summary function, and multi-end synchronous modification of the text content is realized.

S202: and responding to the interactive operation executed by the user through the conference summary interactive interface so as to generate the conference summary content and display the content to the summary display area.

Regarding the specific meeting summary content displayed in the summary display area, in the embodiment of the application, a variety of interactive operation entrances can be provided for the user through a specific meeting summary interactive interface, so that the specific meeting summary content can be generated in the process of interacting with the user. For example, in one mode, a text input function may be directly provided in the summary display area, that is, a text input operation performed by a user may be received through the summary display area, so as to determine the received input content as the meeting summary content. Therefore, the user can input important contents needing to be recorded into the summary presentation area in the process of the meeting, and the important contents are used as the meeting summary contents and directly presented in the summary presentation area.

Or, in another mode, because the conference summary interactive interface further comprises an original text display area, a 'one-key extraction' function can be provided for the user, so that the user can extract partial text content from the text content original text identified by the algorithm to serve as the specific conference summary content. That is, an operation of selecting and extracting target text content as conference summary content, which is performed by a user, may be received through the textual display area, so that the target text content is determined as the conference summary content and is displayed to the summary display area. For example, as shown in fig. 3-2, assuming that the user selects the target text content shown at 33 in the original text display area, at this time, an operation panel may pop up, which may include a selection of "one-key plucking" or the like, and after the user selects "one-key plucking", the currently selected target text file may be extracted to the right summary display area, as shown at 34 in fig. 3-2.

In addition, in practical applications, there may be the following: the text content original text of a certain part identified by the algorithm may be more important for the current user, but as the content may be more, the text content original text is suitable for being used as the conference summary content after being sorted; however, the arranging work requires a certain time, and the conference is in progress, and if the arranging work is immediately performed, the listening to the contents of other conferences may be affected, and therefore, the arranging work may need to be performed after the conference is ended or at other relatively free time. Therefore, in the embodiment of the application, a marking function can be provided for the user, that is, the user can select and mark the text content which needs to be focused on in the original text display area, and meanwhile, the tool records the position information of the marked target text content in the original text. Thereafter, operational options for filtering based on the tags may also be provided in the meeting summary interaction interface. Therefore, if the user needs to check the content marked before, screening is initiated through the operation option, and then the original text position corresponding to the target text content can be located in the original text display area for displaying, so that the user can input the conference summary content according to the original text corresponding to the target text content.

For example, suppose that the user selects a target text content and then presents the tagging operation option, or after the user moves the mouse over a text content item, the entire text content item is automatically selected and the tagging operation option is presented, and the specific tagging operation option may be as shown at 35 in fig. 3-2, and so on. After the user clicks the marking operation option, the marking operation on the currently selected target text content can be completed. It should be noted that, in the specific implementation, a plurality of marking operation options with different colors may be provided, and the user may select the color according to the actual needs. For example, the same type of text content may be labeled with the same color, which facilitates subsequent filtering of the labeled content by category. In addition, an operation option for screening the text content original text according to the marks can be provided in the conference summary interactive interface, and after the user selects the operation option, the text content original text with the marks added specifically can be positioned to be displayed, or only the marked text content original text can be displayed.

In addition, for the case of adding the mark, the server may also record the position information of the target text content of the mark on the corresponding audio time axis, that is, the fraction and the few seconds of the marked content in the conference audio, and so on. Therefore, the audio and video conference is finished, a conference summary interaction interface can be displayed, in the embodiment of the application, an audio playing control bar can be provided in the conference summary interaction interface according to the audio time axis, and mark information is displayed in the audio playing control bar at a position corresponding to the position information. In this way, after receiving an operation on a piece of marking information in the audio playing control bar, the original text position corresponding to the target text content can be located in the original text display area for display, and of course, specific audio can also be located in the corresponding position for playing.

For example, as shown in fig. 4-1, when the meeting summary interaction interface is exposed after the meeting is ended, the content in the interface is richer. In the position shown in 41, an audio playing control bar is displayed, since a user may mark part of text content during a meeting, in order to facilitate quick positioning of the content of a specific mark, before displaying the audio playing control bar, the text content and the position of the text content that have been marked may be determined, so that the specific mark may be displayed on the audio playing control bar. For example, as shown at 42 in fig. 4-1, when the user moves the operation focus to such a mark, prompt information about the specific mark content may be displayed, and then, the user may quickly trace back to the position where the text of the specific marked text content is located for displaying by clicking such a mark, and the audio data may also be positioned to the corresponding position for playing. And then can be added to a specific conference summary through further finishing operation according to the marked key contents.

In addition, as described above, in addition to generating the meeting summary content by means of manual input or "one-click pickup", in the embodiment of the present application, some meeting summary content may be automatically generated by an intelligent algorithm. In specific implementation, after the audio and video conference is finished and a request for displaying the conference summary interactive interface is received, the conference summary content generated through an intelligent algorithm can be provided in the conference summary interactive interface. The conference summary content generated by the intelligent algorithm may be various, and for example, the method specifically includes: keywords for meetings, agenda directories, highlight content and/or to-do information, and so forth. That is to say, after the meeting is ended, in the meeting summary interactive interface, besides the meeting summary content generated by the user through manual input or 'one-key picking' in the meeting, or the meeting summary content input after further sorting according to the marked content after the meeting is ended, some meeting summary content automatically generated through an intelligent algorithm can be displayed.

The keywords may be words appearing more frequently in text content of a specific conference, or words fitting with a conference subject, or the like. The method is specifically realized by a key word extraction (Keyphrase extraction) related technology.

Regarding the directory of the agenda, i.e. the topics listed in the conference. In a specific implementation, there may be a plurality of implementation manners for automatically generating an agenda directory, for example, in one manner, natural language understanding may be performed on a plurality of text contents obtained by conversion, and aggregation may be performed according to sentence correlations, so that the plurality of text contents are divided into a plurality of paragraphs; then, the text content representing the paragraph topic can be extracted from each paragraph respectively to generate the agenda topics of a plurality of agendas for displaying in the agenda directory. In specific implementation, the implementation may be realized by a technique related to Title generation (Title generation).

The important content may be mainly determined by natural language understanding, combining with keywords, and the like, for example, if the beginning of a certain text content is "we re-emphasize", the content after the text content may be generally extracted as the important content, and the like. In the specific implementation, the key sentence extraction (Keysentence extraction) can be implemented by a related technology.

Regarding the to-do content, the to-do content may also be extracted by means of keyword recognition, for example, if a keyword related to time, place, and the like appears in a certain text content, a conference summary content as a to-do content class may be extracted, and the like. In specific implementation, the detection can be realized by using technologies related to the to-do detection (Action detection).

Among them, as to the above-mentioned various intelligently generated meeting summary contents, can be directly shown in the meeting summary interactive interface, for example, an intelligent summary showing area can be provided directly in the meeting summary interactive interface, and each intelligently generated meeting summary content can be shown therein. Or an operation option for triggering the intelligent summary content to be displayed can be provided in the conference summary interactive interface, after the user initiates a request for viewing the intelligent summary content, an intelligent summary display area is created in the conference summary interactive interface, and the intelligent summary display area is displayed. Or, in another mode, the intelligently generated partial summary content can be directly displayed, and the other part can be displayed after the user initiates a request. For example, as shown at 43 in fig. 4-2, with respect to the keywords, agenda directories, may be directly presented in the conference summary interactive interface, specifically, the keywords, agenda directories, etc. may be presented at the top of the textual presentation area. Additionally, as shown at 44 in fig. 4-2, which is an operational option for initiating a request to view more intelligent summary content, a particular meeting summary interaction interface may be as shown in fig. 4-3 after a user initiates a request through the option. That is, the conference summary interaction interface may be divided into three areas, an original text presentation area 45, an intelligent presentation area 46 (mainly for presenting intelligently generated summary content), and a summary presentation area 47 (mainly for presenting summary content manually input or excerpted by the user).

In addition, in specific implementation, a search based on a keyword may also be supported, for example, after a user clicks a certain keyword, each piece of original content related to the keyword may be searched out and displayed in the original display area. In addition, the position information of the paragraph corresponding to the specific agenda topic in the original text can be recorded, so that after the operation on one of the agenda topics in the agenda directory is received, the paragraph corresponding to the agenda topic in the original text display area can be positioned and displayed at the position of the original text. Of course, the audio playing control bar can also be positioned to the corresponding position for playing. For example, after clicking a certain agenda topic, an operation panel may be provided, which may include operation options such as "backtrack", and if the user clicks "backtrack", a specific original text display area will be positioned to a paragraph position corresponding to the agenda topic for display. In addition, regarding the key content, the backlog, and the like, after selecting a certain key content or backlog, operation options such as "backtrack" and the like may be provided, so that the user may be positioned at the corresponding original text position for viewing. Regarding backlogs, an operation option for importing to an office-like instant messaging tool may also be provided, so that backlog content may be imported into the instant messaging tool to generate a specific backlog notification, and so on.

In addition, regarding the agenda directory, each agenda corresponds to a plurality of paragraphs of the whole conference, and therefore, time period information corresponding to the paragraphs corresponding to the agenda subjects on an audio time axis can be recorded, so that when an audio playing control bar is provided in the conference summary interactive interface according to the audio time axis, the audio playing control bar can be displayed in a segmented mode, and the position and the length of each segment are related to the time period information corresponding to a specific agenda. Furthermore, after receiving an operation on one of the target segments on the audio playing control bar, a target agenda topic corresponding to the target segment may be determined, then, a paragraph corresponding to the target agenda topic is positioned in the original text display area and displayed at a position in the original text, and corresponding audio data is positioned at a time segment corresponding to the target agenda topic and played.

It should be noted that, when the meeting summary interaction interface is displayed after the meeting is finished, operation options such as "reprint" and the like may be provided for the user, after the operation option is clicked, the meeting summary interaction interface may enter the editing state again, and regarding each item of intelligently generated meeting summary content, after the user selects a specific piece of content, options such as "one-click pick" and the like may also be provided, so that the user may further pick some content to the summary display area for display. In addition, manual addition by the user may also be supported with respect to meeting agenda, focus content, backlog, and the like. Moreover, a sharing operation option can be provided, the user can share the conference summary content generated by the user to other users, only the conference summary content can be selected to be shared in the sharing process, and the conference summary content can also be selected to be shared together with the original text, the audio and the like.

It should be noted that, in practical application, the embodiment of the present application may also be applied to the transcription and summary generation of an audio and video file, specifically, an operation entry such as audio and video file transcription and summary generation may be provided for a user, after the user selects the operation entry, the audio and video file specifically required to be processed may be uploaded, then, a summary interaction interface may be displayed, and an original text display area and a summary display area are displayed in the interface. The text content original text of the voice recognition is mainly displayed in the original text display area, and meanwhile, the functions of 'one-key picking', intelligent generation of summary and the like can be provided.

In short, according to the embodiment of the application, in the process of the audio and video conference, real-time voice-text conversion can be performed, and the text content obtained through conversion is displayed in the original text display area of the conference summary interaction interface. In addition, the conference summary interactive interface can further comprise a summary display area, so that in the process of conference, specific conference summary content can be determined through interactive operation (including direct input in the summary display area, or selection of target text content from the original text display area for one-key extraction, and the like) initiated by a user based on the specific conference summary interactive interface, and the specific conference summary content is displayed in the summary display area. Therefore, the method can help the user to arrange the conference summary in time in the process of participating in the audio and video conference, is convenient for quickly reviewing the conference summary content after meeting, and improves the efficiency of obtaining information by the user.

Moreover, the marked points are displayed on the audio control bar, or the audio control bar is divided into a plurality of segments for displaying according to a plurality of automatically identified agenda paragraphs, and the like, so that the user can be helped to trace back the marked contents or a specific topic more efficiently.

It should be noted that, in the embodiments of the present application, the user data may be used, and in practical applications, the user-specific personal data may be used in the scheme described herein within the scope permitted by the applicable law, under the condition of meeting the requirements of the applicable law and regulations in the country (for example, the user explicitly agrees, the user is informed, etc.).

Corresponding to the foregoing method embodiment, an embodiment of the present application further provides an interaction apparatus, and referring to fig. 5, the apparatus may include:

the interactive interface providing unit 501 is configured to provide a conference summary interactive interface in an audio and video conference, where the conference summary interactive interface includes an original text display area and a summary display area, the original text display area is used to display text content obtained by real-time conversion according to voice content acquired in the conference process, and the summary display area is used to display the conference summary content;

an interactive operation receiving unit 502, configured to respond to an interactive operation performed by a user through the conference summary interactive interface, so as to generate the conference summary content and display the content to the summary display area.

In a specific implementation, the interoperation receiving unit may specifically be configured to:

Or, the interoperation receiving unit may specifically be configured to:

receiving operation, executed by a user, of selecting target text content and extracting the target text content as conference summary content through the original text display area so as to determine the target text content as the conference summary content and display the conference summary content in the summary display area.

Or, the interoperation receiving unit may specifically be configured to:

In the above case, the apparatus may further include:

a mark position recording unit for recording the position information of the marked target text content on the corresponding audio time axis;

the mark display unit is used for providing an audio playing control bar in the conference summary interactive interface according to the audio time axis after receiving a request for displaying the conference summary interactive interface after the audio and video conference is finished, and displaying a mark at a position corresponding to the position information in the audio playing control bar;

and the first original text positioning display unit is used for positioning the original text position corresponding to the target text content in the original text display area for display after receiving the operation of a mark in the audio playing control bar.

In addition, the apparatus may further include:

and the intelligent summary generation unit is used for providing the conference summary content generated by an intelligent algorithm in the conference summary interaction interface after receiving a request for displaying the conference summary interaction interface after the audio and video conference is finished.

Specifically, the conference summary content generated by the intelligent algorithm includes: keywords for the meeting, agenda directory, focus content, and/or backlog information.

Wherein the agenda directory may be generated by:

In addition, the apparatus may further include:

a first agenda location recording unit, configured to record location information of a paragraph in a text corresponding to the agenda topic;

and the second original text positioning and displaying unit is used for positioning the paragraph corresponding to the agenda topic in the original text displaying area to display the paragraph at the position in the original text after receiving the operation on one of the agenda topics in the agenda directory.

In addition, the apparatus may further include:

the second agenda position recording unit is used for recording time period information corresponding to the paragraph corresponding to the agenda subject on an audio time axis;

the audio bar segmentation display unit is used for providing an audio playing control bar in the conference summary interactive interface according to the audio time axis and performing segmentation display on the audio playing control bar, wherein the position and the length of each segment are related to the time period information;

a target agenda determining unit, configured to determine a target agenda topic corresponding to a target segment after receiving an operation on one of the target segments on the audio playback control bar;

and the third original text positioning and displaying unit is used for positioning the paragraph corresponding to the target agenda subject in the original text displaying area to be displayed at the position of the original text, and positioning the corresponding audio data to the time section corresponding to the target agenda subject to be played.

In addition, the present application also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the method described in any of the preceding method embodiments.

And an electronic device comprising:

one or more processors; and

a memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform the steps of the method of any of the preceding method embodiments.

Fig. 6 schematically shows an architecture of an electronic device, which may specifically include a processor 610, a video display adapter 611, a disk drive 612, an input/output interface 613, a network interface 614, and a memory 620. The processor 610, the video display adapter 611, the disk drive 612, the input/output interface 613, the network interface 614, and the memory 620 may be communicatively connected by a communication bus 630.

The processor 610 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solution provided by the present Application.

The Memory 620 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 620 may store an operating system 621 for controlling the operation of the electronic device 600, a Basic Input Output System (BIOS) for controlling low-level operations of the electronic device 600. In addition, a web browser 623, a data storage management system 624, a summary interaction processing system 625, and the like may also be stored. The interactive processing system 625 may be an application program that implements the operations of the foregoing steps in this embodiment of the present application. In summary, when the technical solution provided in the present application is implemented by software or firmware, the relevant program codes are stored in the memory 620 and called for execution by the processor 610.

The input/output interface 613 is used for connecting an input/output module to realize information input and output. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.

The network interface 614 is used for connecting a communication module (not shown in the figure) to realize the communication interaction between the device and other devices. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).

Bus 630 includes a path that transfers information between the various components of the device, such as processor 610, video display adapter 611, disk drive 612, input/output interface 613, network interface 614, and memory 620.

It should be noted that although the above devices only show the processor 610, the video display adapter 611, the disk drive 612, the input/output interface 613, the network interface 614, the memory 620, the bus 630, etc., in a specific implementation, the device may also include other components necessary for normal operation. Furthermore, it will be understood by those skilled in the art that the apparatus described above may also include only the components necessary to implement the solution of the present application, and not necessarily all of the components shown in the figures.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the system or system embodiments, which are substantially similar to the method embodiments, are described in a relatively simple manner, and reference may be made to some descriptions of the method embodiments for relevant points. The above-described system and system embodiments are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement without inventive effort.

The interaction method, the interaction device, and the electronic device provided by the present application are introduced in detail, and specific examples are applied in the present application to explain the principles and embodiments of the present application, and the descriptions of the above embodiments are only used to help understand the method and the core ideas of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, the specific embodiments and the application range may be changed. In view of the above, the description should not be taken as limiting the application.

Claims

1. An interaction method, comprising:

2. The method of claim 1,

the responding to the interactive operation executed by the user through the conference summary interactive interface comprises the following steps:

receiving a text input operation performed by a user through the summary display area so as to determine the received input content as the conference summary content.

3. The method of claim 1,

4. The method of claim 1,

5. The method of claim 4, further comprising:

after the audio and video conference is finished, receiving a request for displaying the conference summary interactive interface, providing an audio playing control bar in the conference summary interactive interface according to the audio time axis, and displaying a mark at a position corresponding to the position information in the audio playing control bar;

6. The method of claim 1, further comprising:

7. The method of claim 6,

the conference summary content generated by the intelligent algorithm includes: keywords for the meeting, agenda directory, focus content, and/or backlog information.

8. The method of claim 7,

the agenda directory is generated by:

text content representing paragraph topics is extracted from each paragraph respectively to generate agenda topics of a plurality of agendas to be displayed in the agenda directory.

9. The method of claim 8, further comprising:

after receiving the operation of one of the agenda topics in the agenda directory, the position of the paragraph corresponding to the agenda topic in the original text display area is located and displayed at the position in the original text.

10. The method of claim 8, further comprising:

after receiving an operation on one of the target segments on the audio playing control bar, determining a target agenda theme corresponding to the target segment;

and positioning the paragraph corresponding to the target agenda subject in the original text display area at the position in the original text for displaying, and positioning the corresponding audio data at the time section corresponding to the target agenda subject for playing.

11. An interactive apparatus, comprising:

the audio and video conference system comprises an interactive interface providing unit, a conference summary interactive interface providing unit and a conference summary interactive interface providing unit, wherein the conference summary interactive interface comprises an original text display area and a summary display area, the original text display area is used for displaying text contents obtained by real-time conversion according to voice contents acquired in the process of a conference, and the summary display area is used for displaying the conference summary contents;

12. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 10.

13. An electronic device, comprising:

one or more processors; and

a memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform the steps of the method of any of claims 1 to 10.