CN113326387B

CN113326387B - Intelligent conference information retrieval method

Info

Publication number: CN113326387B
Application number: CN202110603641.6A
Authority: CN
Inventors: 孟强祥; 田俊麟; 宋昱
Original assignee: Introduction Of Chinese Technology Shenzhen Co ltd
Current assignee: Introduction Of Chinese Technology Shenzhen Co ltd
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2022-12-13
Anticipated expiration: 2041-05-31
Also published as: CN113326387A

Abstract

The invention discloses an intelligent retrieval method for meeting information, which relates to the technical field of meeting records and comprises the following steps: recording conference information, recording and extracting audio streams of conference video contents in a multimedia mode in the whole process in real time, sending the audio streams to a voice recognition module to convert voice into character information, storing the character information, marking according to the conference progress time, inputting text information or voice information to inquire, matching and inquiring with the conference information stored previously, and returning corresponding audio or video information. The conference information is recorded in a multimedia mode, the retrieval information and the conference information are matched and inquired through multi-level processing, when the conference record is matched, the information of the time shaft where the corresponding record is located is displayed, and meanwhile, the audio information which enables a user to hear the speaking of the current conference is played, so that the later analysis and understanding of the conference are more convenient, and the conference record retrieval experience is greatly improved.

Description

Intelligent conference information retrieval method

Technical Field

The invention relates to the technical field of conference recording, in particular to an intelligent conference information retrieval method.

Background

As technology advances, many products that automatically record conference content are being launched. From the earliest recorders to automated speech-to-text equipment. These recording methods record a lot of contents, which often last for several hours. Resulting in time and effort for reviewing or retrieving the meeting record. Although some advanced products label conference participants according to human biological characteristics such as voiceprints, fingerprints and the like, and then quickly locate conference recording contents through the labels, even labeling by using geographic information and administrative levels, the advanced products have the disadvantages of being not humanized, such as: conference records cannot be inquired and retrieved according to contents, the inquiry records are single in mode, only can be manually reviewed and listened, and cannot be quickly positioned.

Disclosure of Invention

The invention aims to provide an intelligent conference information retrieval method to overcome the defects in the prior art.

In order to achieve the above purpose, the invention provides the following technical scheme: an intelligent conference information retrieval method comprises the following steps:

recording meeting information in a multimedia mode in real time in the whole process, wherein the recording comprises archiving of the whole video, audio, text and other forms of the meeting;

step two, extracting the audio stream of the conference video content, copying the audio stream from a media file or a Container (Container) of a stream file by using demultiplexing (demux) to extract the audio stream from the video stream, and sending the audio stream to a voice recognition module to convert voice into text information and store the text information, wherein the original video file is unchanged;

marking the video, audio and text of the conference record according to the time of the conference, taking a speaking detection technology or a silence detection technology as a starting and ending judgment basis, further combining a context judgment technology of NLP (natural language processing) to take the speaking content as a unit or take words as a unit according to the SBD (sequence boundary prediction) and the WS (Word Segmentation) with smaller granularity, and respectively adding marks according to sentences and words and storing the processed conference record content;

step four, the user searches the conference record, inputs text information or voice information for inquiry, converts the voice into a text through the voice-to-text module if the voice information is received, matches and inquires the text with the conference information stored previously, returns corresponding audio or video information and attaches the text information converted by the voice;

and step five, when the user views the returned result, the recorded content of the context can be quickly retrieved, namely, the user can simultaneously view the conference information before and after the retrieved time period, the recorded content is displayed to the user in text, audio or video information through highlighting, and the user can intuitively position, select and modify the corresponding content.

Preferably, in the first step, if the conference is a network video conference, the conference information is directly obtained through the network, and if the conference is a non-network conference, the conference is recorded through multimedia devices such as audio recording and video recording, and extraction and conversion are performed.

Preferably, the text information converted by the voice in the second step can be used for displaying and recording the real-time conference subtitles while being stored.

Preferably, the time interval marked in step three is marked by a sentence or a pause in the audio containing the content of the utterance.

Preferably, the marked Video Segments, audio Segments and Text Segments in the third step are stored in a one-to-one correspondence with a time sequence table, wherein the Video Segments are recorded in a List VSRL (Video Segments Recording List) in time sequence, the audio Segments are recorded in a List SSRL (Speech Segments Recording List) in time sequence, and the Text segment information is recorded in a List TSRL (Text Segments Recording List) in time sequence.

Preferably, the matching process in the fourth step includes the following steps:

step a, first-level character matching, wherein text information generated by user searching is used for matching text information stored in a TSRL, if the text information can be matched, audio information of a corresponding time period is returned, and if corresponding video information exists, the video information of the corresponding time period is directly returned.

B, second-level character matching, if the first level can not be matched, reducing the text information to smaller granularity through SBD for matching again, if the text information can be matched, returning the corresponding audio or video information,

and c, second-stage processing, namely decomposing the information into smaller granularity for re-matching through WS if the second-stage processing cannot be matched, returning corresponding audio or video information if the information can be matched, and otherwise, ensuring that the query information cannot be matched.

In the technical scheme, the invention provides the following technical effects and advantages:

the invention records the conference information in a multimedia mode, marks and stores the video, audio and text recorded by the conference according to the conference proceeding time, the user searches and matches the text information, matches and inquires the searched information and the conference information through multi-stage processing, when the conference record is matched, the information of the time axis where the corresponding record is located is displayed, the user can select the text information through the interactive equipment, the corresponding audio is highlighted, and simultaneously the audio information which enables the user to intuitively hear the speaking of the current conference is played, the user can randomly select any paragraph in the text module, the corresponding audio or video can be synchronously positioned and played, otherwise, the user can quickly search the audio or video content, the corresponding text information can be immediately displayed, thereby the later analysis and understanding of the conference are more convenient, and the conference record searching experience is greatly improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings can be obtained by those skilled in the art according to the drawings.

FIG. 1 is a flow chart of the present invention.

FIG. 2 is a flow chart of a query matching process of the present invention.

FIG. 3 is an exemplary diagram of an interaction interface when the present invention returns a result.

Fig. 4 is a diagram of another example of a pickup interface for a case where only audio and text information are returned according to the present invention.

FIG. 5 is an exemplary diagram of an interface for a user to select a query message status in the state of FIG. 4 according to the present invention.

Description of reference numerals:

A. a video information display module; B. a video information clip display module of a time axis; C. a text information display module; D. an audio information display module; E. and a time position display module.

Detailed Description

In order to make those skilled in the art better understand the technical solution of the present invention, the present invention will be further described in detail with reference to the accompanying drawings.

The invention provides an intelligent conference information retrieval method, which comprises the following steps:

recording conference information, recording the whole process in a multimedia mode in real time, filing the whole conference in forms of whole video, audio, text and the like, directly acquiring the conference information through a network if the conference is a network video conference, and recording the conference through multimedia equipment such as sound recording and video recording if the conference is a non-network conference, and extracting and converting;

step two, extracting the audio stream of the conference video content, copying the audio stream from a media file or a Container (Container) of a stream file by using demultiplexing (demux) to extract the audio stream from the video stream, and sending the audio stream to a voice recognition module to convert voice into text information and store the text information while the original video file is kept unchanged, wherein the audio stream can be used for displaying and recording a real-time conference caption;

marking the video, audio and text of the conference record according to the time of the conference, taking a talk detection technology or a silence detection technology as a starting and ending judgment basis, taking a Sentence or a pause containing the talk content in the audio as a mark at a time interval, further combining with a context judgment technology of NLP (natural language processing) to take the talk content as a unit or a Word as a unit according to the Sentence or the Word by not limiting SBD (sequence boundary prediction) and WS (Word Segmentation) with smaller granularity, and respectively adding marks according to the Sentence and the Word and storing the processed conference record content;

the marked Video Segments, audio Segments and Text fields are respectively stored in a one-to-one correspondence manner by setting a time sequence table, wherein the Video Segments are recorded in a List VSRL (Video Segments Recording List) according to the time sequence, the audio Segments are recorded in a List SSRL (speed Segments Recording List) according to the time sequence, the Text segment information is recorded in a List TSRL (Text Segments Recording List) according to the time sequence, and the structures of the VSRL, the SSRL and the TSRL are respectively shown in a table 1, a table 2 and a table 3:

table 1 vsrl examples

Sequence No.	Time Offset	Duration	SegmentsURL
				0	00:00:00.000	1000	VS001.mp4
1	00:00:01.000	1000	VS002.mp4
				2	00:00:02.000	1500	VS003.mp4
…	…	…	…

Wherein, the first and the second end of the pipe are connected with each other,

sequence No. represents a mark serial number, and the key value of the mark similarity relation table is unique and is corresponding to the SSRL and the TSRL;

time Offset represents the Offset from the entire video, from the beginning to the current;

duration represents the time length of the current segment in milliseconds ms;

segmentsrurl indicates the URL information of the video file storing the current segment; the streaming media player can directly play the corresponding video by using the URL; in actual use, the address should be further encrypted, and the data security is improved through encryption.

TABLE 2 SSRL examples

Sequence No.	Time Offset	Duration	SegmentsURL
				0	00：00：00.000	1000	SS001.wav
1	00：00：01.000	1000	SS002.wav
				2	00：00：02.000	1500	SS003.wav
…	…	…	…

Wherein the content of the first and second substances,

sequence No. indicates a tag number, the same as VSRL;

duration represents the time length of the current segment in milliseconds ms;

segmentsrurl indicates the audio file URL information that stores the current segment; the streaming media player can directly play the corresponding audio by using the URL; in actual use, the address should be further encrypted, and the data security is improved through encryption.

Wherein

Sequence No. _VSRL ＝Sequence No. _SSRL ＝Sequence No. _TSRL

Table 3 tsrl examples

sequence No. indicates a tag number, which is the same as VSRL;

the Original Language Code represents the Language of an Original text, and is represented by an ISO-639-1 standard, wherein en is English, zh is Chinese and the like;

code Page, representing character set of literal Code, 1209 UTF-8Unicode;

characters, representing a file URL where text is stored;

the text matching process comprises the following steps:

B, second level character matching, if the first level can not be matched, reducing the text information to smaller granularity through SBD for matching again, if the text information can be matched, returning the corresponding audio or video information,

step c, second level processing, if the second level can not be matched, the information is decomposed into smaller granularity through WS and matched again, if the information can be matched, the corresponding audio or video information is returned, otherwise, the inquiry information can not be matched

And step five, when the user checks the returned result, the recorded content of the context can be quickly searched, namely, the user can simultaneously check the meeting information before and after the searched time period, the recorded content is displayed to the user in text, audio or video information through highlighting, and the user can intuitively position, select and modify the corresponding content.

In summary, the present invention records in a multimedia manner, includes archiving the whole video, audio, text, etc. of a conference, sends an audio stream to a voice recognition module to convert voice into text information, marks the video, audio, and text recorded in the conference according to the time of the conference, and stores the video, audio, and text in a one-to-one correspondence with the time mark as a basis, a user queries by inputting text information or voice information, if the voice information is received, converts the voice into text through a voice-to-text module, matches and queries the conference information through multi-level processing, when the conference record is matched, displays information corresponding to the time axis in which the record is located, including information of upper and lower paragraphs, the user can select text information through an interactive device such as a mouse or a touch screen, the text information is highlighted, the corresponding audio is also displayed, and plays audio information that the user can intuitively speak in the conference at the same time, if there is video information corresponding to the recorded, the user can randomly select any paragraph in the text module, the corresponding audio or video can be synchronously positioned and played, otherwise, the user can quickly retrieve the audio or video content of the corresponding text information at the conference, and can be immediately displayed, thereby greatly improving the experience of the later retrieval, and the later-stage of the conference can be more conveniently retrieved.

While certain exemplary embodiments of the present invention have been described above by way of illustration only, it will be apparent to those of ordinary skill in the art that the described embodiments may be modified in various different ways without departing from the spirit and scope of the invention. Accordingly, the drawings and description are illustrative in nature and should not be construed as limiting the scope of the invention.

Claims

1. An intelligent conference information retrieval method is characterized by comprising the following steps:

recording conference information in a multimedia mode in real time in the whole process, wherein the whole process comprises archiving in the forms of whole video, audio, text and the like of a conference;

marking the video, audio and text of the conference record according to the time of the conference, taking a speaking detection technology or a silence detection technology as a starting and ending judgment basis, further combining a context judgment technology of NLP (natural language processing) including but not limited to SBD (sequence boundary prediction) and WS (Word Segmentation) with smaller granularity to process the speaking content as a unit or a Word as a unit, and adding marks on the processed conference record content according to sentences and words and storing the processed conference record content;

step four, the user searches the meeting record, inputs text information or voice information for inquiry, if the voice information is received, the voice is converted into text through the voice-to-text module, and the text is matched and inquired with the meeting information stored previously, and corresponding audio or video information is returned, and the text information converted by the voice is attached, wherein the matching process comprises the following steps:

step a, first-level character matching, wherein text information generated by user searching is used for matching text information stored in a TSRL, if the text information can be matched, audio information of a corresponding time period is returned, and if corresponding video information exists, the video information of the corresponding time period is directly returned;

b, second-level character matching, wherein if the first level character matching cannot be achieved, the text information is reduced to smaller granularity through SBD and matched again, and if the text information can be matched, the corresponding audio or video information is returned;

step c, second-level processing, namely decomposing the information into smaller granularity for re-matching through WS if the second-level processing fails to match, returning corresponding audio or video information if the information can be matched, otherwise, ensuring that the query information cannot be matched;

2. The intelligent conference information retrieval method according to claim 1, wherein: in the first step, if the conference is a network video conference, the conference information is directly acquired through the network, and if the conference is a non-network conference, the conference is recorded through multimedia equipment such as audio recording and video recording, and extraction and conversion are performed.

3. The intelligent conference information retrieval method according to claim 1, wherein: and the text information converted by the voice in the second step can be used for displaying and recording the real-time conference subtitles while being stored.

4. The intelligent conference information retrieval method according to claim 1, wherein: the time interval marked in step three is marked by a sentence or a pause in the audio containing the content of the utterance.

5. The intelligent conference information retrieval method according to claim 1, wherein: the marked Video Segments, audio Segments and Text Segments in the third step are respectively stored in a one-to-one correspondence way by setting a time sequence List, wherein the Video Segments are recorded in a List VSRL (Video Segments Recording List) according to the time sequence, the audio Segments are recorded in a List SSRL (Speech Segments Recording List) according to the time sequence, and the Text segment information is recorded in a List TSRL (Text Segments Recording List) according to the time sequence.