CN112905829A

CN112905829A - Cross-modal artificial intelligence information processing system and retrieval method

Info

Publication number: CN112905829A
Application number: CN202110320317.3A
Authority: CN
Inventors: 王芳; 连芷萱
Original assignee: Individual
Current assignee: Individual
Priority date: 2021-03-25
Filing date: 2021-03-25
Publication date: 2021-06-04

Abstract

A cross-modal artificial intelligence information processing system and a cross-modal information retrieval method are provided. The system comprises: a separation module configured to separate the first-modality information into a plurality of continuous pieces of first-modality information; the characteristic extraction module is configured to perform characteristic extraction on the content expressed by each piece of the first modality information to form an event map; an identification module configured to identify elements in the event map with second modality information to form second modality identification information; a second encoding module configured to encode the second modality identification information to form second modality information data; the association module is configured to associate the second modality information data with each frame of data in the first modality information fragment of the corresponding segment to generate an association identifier; a first insertion module configured to insert the association identifier into a first modality data frame; a second insertion module configured to associate the identified insertion into the second modality data frame.

Description

Cross-modal artificial intelligence information processing system and retrieval method

Technical Field

The invention relates to a cross-modal artificial intelligence information processing system and a retrieval method, and belongs to the technical field of artificial intelligence.

Background

In the prior art, text information can be searched for full text through keywords, and for audio/video information, it is impossible to search for information of interest in an audio time period and a video time period of a certain time length.

Disclosure of Invention

The invention aims to provide a cross-modal artificial intelligence information processing system and a retrieval method, which can quickly retrieve and reproduce cross-modal information.

To achieve the above object, the present invention provides a cross-modal artificial intelligence information processing system, comprising: a separation module configured to separate the first-modality information into a plurality of continuous pieces of first-modality information; the characteristic extraction module is configured to perform characteristic extraction on the content expressed by each piece of first modality information fragment to form an event map representing events and the relationship thereof in the content expressed by each piece of first modality data fragment; an identification module configured to identify elements in the event map with second modality information to form second modality identification information; a second encoding module configured to encode the second modality identification information to form second modality information data; the association module is configured to associate the second modality information data with each frame of data in the first modality information fragment of the corresponding segment to generate an association identifier; the first insertion module is configured to insert the association identifier into the first modality data frame and then store the association identifier in the first modality information database; a second insertion module configured to associate the identified insertion into the second-modality data frame and then store in the second-modality information database.

Preferably, the first modality information includes voice and/or video; the second modality information includes text.

Preferably, the feature extraction module comprises an event map establishing module configured to establish an event map according to contents expressed by the first modality information source and an accumulation module configured to accumulate durations of consecutive identical event maps; the separation module is further configured to separate the first-modality information according to the duration to obtain a plurality of continuous first-modality information fragments.

Preferably, the cross-modal artificial intelligence information processing system further comprises a first encoding module, and the first encoding module is configured to encode the separated first-modal information segment to generate first-modal information data.

Preferably, the first modality information includes video data; the second modality information includes text.

Preferably, the feature extraction module comprises a conversion module, an artificial intelligence module, an event map building module and an accumulation module, wherein the conversion module converts the first modal information data into a two-dimensional image; the artificial intelligence module is configured to identify characteristic values of each frame of two-dimensional image, wherein the characteristic values comprise foreground image characteristic values and background image characteristic values; the event map establishing module is configured to establish an event map according to the relation between the primitives represented by the foreground image characteristic value and the primitives represented by the background image characteristic value of each frame of image; the accumulation module is configured to accumulate durations of consecutive identical event maps; the allocation module is further configured to divide the first-modality information into a plurality of continuous first-modality information segments according to the time length.

In order to achieve the above object, the present invention further provides a method for cross-modal information retrieval using the above system, comprising the steps of: searching corresponding second modal data in a second modal information database according to the input second modal information; extracting an association head of the second modality data; and retrieving the first-mode information data frame from the first-mode information database according to the associated head, and reproducing the first-mode information by using the first-mode information data frame.

Compared with the prior art, the invention aims to provide a cross-modal artificial intelligence information processing system and a retrieval method, which can quickly perform cross-modal information retrieval.

Drawings

FIG. 1 is a block diagram of a cross-modal artificial intelligence information handling system provided in a first embodiment of the present invention;

FIG. 2 is a schematic diagram showing the separation of information in a first modality into a plurality of information fragments;

FIG. 3 is a block diagram of a first encoding module in a cross-modal artificial intelligence information handling system, according to an embodiment of the present invention;

FIG. 4 is a block diagram of an inter-frame prediction processing module according to an embodiment of the present invention;

FIG. 5 is a block diagram of a cross-modal artificial intelligence information handling system provided by a second embodiment of the present invention;

FIG. 6 is a flowchart of a cross-modal information retrieval method provided by the present invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the description of the present invention, it should be noted that the terms "first", "second", and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

First embodiment

Fig. 1 is a block diagram of a cross-modality artificial intelligence information processing system according to a first embodiment of the present invention, and as shown in fig. 1, the cross-modality artificial intelligence information processing system according to the first embodiment includes: a first modality information source 510, which is, for example, an audio information source acquired by an acoustic-electric converter or an image information source acquired by a photoelectric converter; a separating module 520 configured to separate the first-modality information into a plurality of consecutive pieces of first-modality information; the characteristic extraction module is configured to perform characteristic extraction on the content expressed by each section of the first modality information fragment to form an event map representing events and relations thereof in the content expressed by each section of the first modality data fragment, wherein the event map is organized in a tree structure, and each node in the tree structure is called as an element; an identifying module 580 configured to identify elements in the event graph with second modality information to form second modality identification information; a second encoding module 590 configured to encode the second modality identification information to form second modality information data, that is, encode the second modality information by using a character string, where the character string includes a binary character string; an association module 570 configured to associate the second modality information data with the first modality information pieces of the respective segments to generate association identifications (or association pointers); a first inserting module 540, configured to insert the association identifier into each frame of data of the first-modality data information data fragment, and then store the association identifier in the first-modality information database or send the association identifier to a channel encoder, and send the association identifier to the communication unit after channel encoding; a second insertion module 600 configured to insert the associated identifier into the second-modality data frame and then store it in the second-modality information database or send it to the channel encoder, channel-encode it and then send it to the communication unit.

In a first embodiment, the first modality information includes voice and/or video, and the voice includes multi-language voice, dialect and the like; the second modality information includes text, the text including a plurality of language words.

In the first embodiment, each frame of data in the first modality information data has the following format:

first mode information data head

First modality information data

Each frame of data in the second modality information data has the following format:

second mode information data head

Second modality information data

The first-modality information data inserted into the association header has the following format:

correlation head

First mode information data head

First modality information data

The second modality information data inserted into the association header has the following format:

correlation head

Second mode information data head

Second modality information data

In the first embodiment, the feature extraction module comprises an event map establishing module 550 and an accumulating module 560, the event map establishing module 550 is configured to establish an event map according to the content expressed by the first-modality information sources, and the accumulating module 560 is configured to accumulate the durations of consecutive same event maps, namely, the first-modality information sources represent the same event time period; the assigning module 520 is further configured to separate consecutive pieces of first-modality information having a certain duration according to the duration to obtain consecutive pieces of first-modality information. As shown in fig. 2, the video information with a set duration T expresses four events, event 1, event 2, event 3 and event 4, and the partitioning module divides the video into four segments with durations T1, T2, T3 and T4, respectively. Preferably each event 1 can be further subdivided according to the different contents of the expression.

In the first embodiment, the cross-modal artificial intelligence information processing system further includes a first encoding module 530, which is configured to encode the separated first-modal information segment to generate first-modal information data. In the present invention, when the first mode information is video information, the first encoding module adopts the structural form shown in fig. 3 to 4.

FIG. 3 is a block diagram of a first encoding module according to the present invention. As shown in fig. 3, in the first encoding block, the prediction residual signal generation block 103 obtains a difference between the input video signal and the prediction signal that is the output of the inter prediction processing block 102, and outputs the difference as a prediction residual signal. The conversion module 104 performs orthogonal transformation such as discrete cosine transformation on the prediction residual signal, quantizes the transform coefficient, and outputs the quantized transform coefficient. The entropy coding module 105 entropy codes the quantized transform coefficient and outputs it as a coded stream. On the other hand, the quantized transform coefficients are also input to the inverse transform module 106, where inverse quantization and inverse orthogonal transform are performed to output prediction residual signals. The decoded video signal generation module 107 adds the prediction residual signal to the prediction signal output from the inter-prediction processing module 102, and generates a decoded video signal of the block to be encoded after encoding. The decoded video signal is output to the loop filter processing block 108 so as to be used as a reference image in the inter prediction processing block 102. The loop filter processing block 108 performs filtering processing for reducing coding distortion, and outputs the image after the filtering processing to the inter-frame prediction processing block 102 as a decoded video signal.

Fig. 3 is a block diagram of the inter prediction processing module 102 according to the present invention, and as shown in fig. 3, the inter prediction processing module 102 includes a reduced image generating unit 291, a pre-search processing unit 292, a first mode decision unit 293, an integer pixel search processing unit 294, a fractional image generating unit 295, a fractional pixel search processing unit 296, and a second mode decision unit 297. The reduced image generation unit 291 receives the current frame image signal and the previous frame image signal as input, performs reduction processing using, for example, a convolutional neural network CNN, and outputs the signals. The pre-search processing unit 292 inputs the reduced current frame image signal and previous frame image signal, performs motion search processing on the reduced current frame image signal, and transfers the searched motion vector to the integer pixel search processing unit 294. First mode decision section 293 receives encoding mode information from pre-search processing section 292 as input. The integer pixel search processing unit 294 performs search processing of integer pixels according to the motion vector and the encoding mode. The decimal image generating unit 207 generates a decimal pixel interpolation image of the corresponding previous frame image position, and outputs to the decimal pixel search processing unit 296; the second mode decision unit 297 receives encoding mode information from the integer pixel search processing unit 203 and inputs to the fractional pixel search processing unit 296; the fractional pixel search processing unit 296 performs search processing of fractional pixels by the motion vector and the encoding mode respectively specified by the integer pixel search processing unit 294 and the second mode decision unit 297. The fractional pixel search processing unit 296 searches for a prediction residual image and motion vector information, and extracts a feature value from the prediction residual image and the motion vector information. The first embodiment of the present invention can improve the coding efficiency by the above scheme.

Second embodiment

FIG. 5 is a block diagram of a cross-modal artificial intelligence information processing system according to a second embodiment of the present invention, and as shown in FIG. 5, the cross-modal artificial intelligence information processing system according to the second embodiment includes: a first modality data source 310 configured to acquire first modality information data from a plurality of information sources, such as audio data and/or video data acquired through a channel decoder, audio data and/or video data acquired through a network, the first modality information data having a plurality of time-series data frames, the first modality information data being displayable via a display component of a development process expressing one or more events; a separating module 320 configured to separate the first modality information data into a plurality of consecutive segments of the first modality information data, each segment of the first modality information data having a plurality of time series data frames; the characteristic extraction module is configured to perform characteristic extraction on the content expressed by each piece of first modality information data to form an event map representing the events and the relationship of the events for reproducing each piece of first modality information data; an identification module 370 configured to identify the elements in the event graph with second modality information to form second modality identification information; a second encoding module 390 configured to encode second modality identification information to form second modality information data; an association module 380 configured to associate the second modality information data with each frame of data in the first modality information data segment of the corresponding segment to generate an association identifier; the first inserting module 340 is configured to insert the associated identifier into the first modality information data frame, and then store the associated identifier in the first modality information database or send the associated identifier to the channel encoder, and then send the associated identifier to the communication unit after channel encoding; a second insertion module 400 configured to insert the associated identifier into the second-modality data frame and then store it in the second-modality information database or send it to the channel encoder, channel-encode it and then send it to the communication unit.

In a first embodiment, the first modality information includes voice data and/or video data; the second modality information includes text.

In the first embodiment, the feature extraction module includes a conversion module 330, an artificial intelligence module 340, an event map creation module 350 and an accumulation module 370, wherein the conversion module 330 converts the first modality information data into a two-dimensional image in a time series; the artificial intelligence module is configured to identify deep image characteristic values of each frame of two-dimensional image, wherein the deep image characteristic values comprise background image characteristic values and a plurality of foreground image characteristic values; an event map establishing module 350 configured to establish an event map according to a relationship between primitives represented by a plurality of foreground image feature values of each frame of image and a relationship between the primitives represented by the background image feature values; the accumulation module 360 is configured to accumulate durations of consecutive identical event maps; the separating module 320 may further be configured to separate the first-modality information according to time duration to obtain a plurality of continuous first-modality information fragments

In the second embodiment, each frame of data in the second modality information data has the following format:

first mode information data head

First modality information data

second mode information data head

Second modality information data

correlation head

First mode information data head

First modality information data

correlation head

Second mode information data head

Second modality information data

In a second embodiment, the artificial intelligence module includes a Convolutional Neural Network (CNN) configured to classify an input image into background image feature values and foreground image feature values, and to classify the foreground image feature values into a plurality of foreground primitive feature values. The convolutional neural network is applied to an image recognition technique for recognizing a predetermined shape or pattern from image data as input data, and has an intermediate layer and a full-key layer. The intermediate layer is formed by hierarchically connecting a plurality of feature extraction processing layers. The intermediate layer includes a convolutional layer and a pooling layer.

Fig. 6 is a flowchart of the artificial intelligence cross-modal information retrieval method provided by the present invention, and as shown in fig. 6, the method for performing cross-modal information retrieval by using the system provided by the present invention includes the following steps: searching corresponding second modal information data in a second modal information database according to second modal information (such as text keywords) input by a user; extracting an association head of the second modal information data; first modality information data (such as video data stream and audio data stream) is retrieved from a first modality information database according to the associated head, and the first modality information is reproduced by using the retrieved first modality data, such as images are reproduced by a display device, and sounds are reproduced by using a loudspeaker.

When the technical scheme provided by the invention is used for searching the text keywords, the associated audio/video data segments can be quickly found according to the event map and reproduced by utilizing the audio/video data segments, and the audio/video data in the whole process is not required to be converted into audio and/or video, so that the cross-modal information search can be realized, and the search efficiency is improved; meanwhile, the user can watch the video hoped to be concerned and/or listen to the audio clip hoped to be listened to, and does not need to concern the part not wanted to be concerned, so that the time utilization rate of the user is improved.

The present invention can be realized by a computer that implements the embodiments of the embodiments described above, or by recording a program for implementing the embodiments in a computer-readable recording medium and causing a computer system to read and execute the program recorded in the recording medium. The term "computer system" as used herein includes hardware such as an OS and peripheral devices. The "computer-readable recording medium" refers to a storage device such as a flexible disk, a magneto-optical disk, a removable medium such as a ROM or a CD-ROM, or a hard disk incorporated in a computer system.

Further, the "computer-readable recording medium" may include a medium that dynamically holds the program for a short time, for example, a communication line that transmits the program through a network such as the internet or a communication line such as a telephone line, or may include a medium that holds the program for a predetermined time, for example, a volatile memory in a computer system serving as a server or a client in this case. The program may be a program for realizing a part of the above-described functions, a program for realizing the above-described functions by combining with a program already recorded in a computer system, or a program realized by using hardware such as PLD or FPGA.

The above embodiments are only used for illustrating the present invention, and the structure, the arrangement position, the connection mode, and the like of each component can be changed, and all equivalent changes and improvements based on the technical scheme of the present invention should not be excluded from the protection scope of the present invention.

Claims

1. A cross-modal artificial intelligence information processing system, comprising: a separation module configured to separate the first-modality information into a plurality of continuous pieces of first-modality information; the characteristic extraction module is configured to perform characteristic extraction on the content expressed by each piece of first modality information fragment to form an event map representing events and the relationship thereof in the content expressed by each piece of first modality data fragment; an identification module configured to identify elements in the event map with second modality information to form second modality identification information; a second encoding module configured to encode the second modality identification information to form second modality information data; the association module is configured to associate the second modality information data with each frame of data in the first modality information fragment of the corresponding segment to generate an association identifier; the first insertion module is configured to insert the association identifier into the first modality data frame and then store the association identifier in the first modality information database; a second insertion module configured to associate the identified insertion into the second-modality data frame and then store in the second-modality information database.

2. The cross-modality artificial intelligence information handling system of claim 1 wherein the first modality information includes voice and/or video; the second modality information includes text.

3. The cross-modality artificial intelligence information handling system of claim 2 wherein the feature extraction module includes an event map creation module configured to create an event map from content expressed by the first modality information source and an accumulation module configured to accumulate durations of consecutive identical event maps; the separation module is further configured to separate the first-modality information according to the duration to obtain a plurality of continuous first-modality information fragments.

4. A cross-modality artificial intelligence information handling system according to claim 3, further comprising a first encoding module for encoding the separated first-modality information fragments to generate first-modality information data.

5. The cross-modality artificial intelligence information handling system of claim 1 wherein the first modality information includes video data; the second modality information includes text.

6. The cross-modality artificial intelligence information processing system of claim 5, wherein the feature extraction module comprises a conversion module, an artificial intelligence module, an event map creation module, and an accumulation module, wherein the conversion module converts the first modality information data into a two-dimensional image; the artificial intelligence module is configured to identify characteristic values of each frame of two-dimensional image, wherein the characteristic values comprise foreground image characteristic values and background image characteristic values; the event map establishing module is configured to establish an event map according to the relation between the primitives represented by the foreground image characteristic value and the primitives represented by the background image characteristic value of each frame of image; the accumulation module is configured to accumulate durations of consecutive identical event maps; the allocation module is further configured to divide the first-modality information into a plurality of continuous first-modality information segments according to the time length.

7. A method for cross-modal information retrieval using the system of any of claims 1-7, comprising the steps of:

searching corresponding second modal data in a second modal information database according to the input second modal information; extracting an association head of the second modality data; and retrieving the first-mode information data frame from the first-mode information database according to the associated head, and reproducing the first-mode information by using the first-mode information data frame.