CN116978048A

CN116978048A - Method, device, electronic equipment and storage medium for obtaining context content

Info

Publication number: CN116978048A
Application number: CN202311242580.0A
Authority: CN
Inventors: 李犇; 张�杰
Original assignee: Beijing Zhongguancun Kejin Technology Co Ltd
Current assignee: Beijing Zhongguancun Kejin Technology Co Ltd
Priority date: 2023-09-25
Filing date: 2023-09-25
Publication date: 2023-10-31
Anticipated expiration: 2043-09-25
Also published as: CN116978048B

Abstract

The embodiment of the disclosure provides a method, a device, electronic equipment and a storage medium for acquiring context content, and relates to the technical field of chart image processing, wherein the method comprises the following steps: dividing the image document into a plurality of layout blocks according to the layout format of the image document; acquiring a text paragraph block and an image chart block in a layout block; converting the content in the text paragraph blocks into text characters; acquiring a context paragraph text of the image chart block according to the text characters; converting the text of the context paragraph into text of the context sentence; and acquiring the context content of the image chart block according to the image chart block and the context sentence text through the multi-mode visual language model. By the embodiment scheme, charts in the image document and context content in the image can be mined.

Description

Method, device, electronic equipment and storage medium for obtaining context content

Technical Field

The embodiment of the disclosure relates to the technical field of chart image processing, in particular to a method, a device, electronic equipment and a storage medium for acquiring context contents.

Background

In the financial field, a great deal of content is generated every day, such as various financial news of securities markets, stock market broadcasting, industry and company research reports, marketing company financial reports and information notices, product publicity of a fund trust insurance company and the like, and the content information often contains charts, images, characters and the like and exists in the form of rich text format image documents, and many financial institutions collect public financial image documents and extract the content information therein for investment analysis and auxiliary decision making. The chart contains rich information, information display can be performed intuitively, and the chart context also contains a plurality of descriptions and understanding analysis contents of the chart.

Disclosure of Invention

The embodiment of the disclosure provides a method, a device, electronic equipment and a storage medium for acquiring context content, which can mine charts in an image document and context content in the image.

In a first aspect, an embodiment of the present disclosure provides a method for obtaining a context, the method including:

dividing an image document into a plurality of layout blocks according to the layout format of the image document;

acquiring a text paragraph block and an image chart block in the layout block;

converting the content in the text paragraph blocks into text characters;

acquiring a context paragraph text of the image chart block according to the text characters;

converting the text of the context paragraph into text of a context sentence;

and acquiring the context content of the image chart block according to the image chart block and the context sentence text through a multi-mode visual language model.

In a second aspect, the present disclosure provides a context acquisition apparatus, comprising:

the dividing module is used for dividing the image document into a plurality of layout blocks according to the layout format of the image document;

the first acquisition module is used for acquiring a text paragraph block and an image chart block in the layout block;

the character conversion module is used for converting the content in the text paragraph block into text characters;

the second acquisition module is used for acquiring the text of the context paragraph of the image chart block according to the text characters;

the text conversion module is used for converting the text of the context paragraph into text of a context sentence;

and the third acquisition module is used for acquiring the context content of the image chart block according to the image chart block and the context sentence text through the multi-mode visual language model.

In a third aspect, the present disclosure provides an electronic device, which may include:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,

the memory stores one or more computer programs executable by the at least one processor, one or more of the computer programs being executable by the at least one processor to enable the at least one processor to perform the contextual content acquisition method.

In a fourth aspect, the present disclosure provides a computer readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the context content acquisition method.

According to the embodiment of the disclosure, the image document can be divided into a plurality of layout blocks according to the layout format of the image document, the text paragraph blocks and the image chart blocks in the layout blocks are obtained, and the contents in the text paragraph blocks are converted into text characters, so that the multi-modal information of the image document can be obtained according to the document layout analysis technology and the character recognition technology; in addition, the context paragraph text of the image chart block is obtained from the text characters, the context paragraph text is converted into the context sentence text, and the content of the image document is obtained according to the image chart block and the context sentence text through the multi-modal visual language model, so that the mining and matching of the image chart context semantic content in the image document based on the multi-modal information and the multi-modal visual language model are realized, and accurate data support is provided for subsequent tasks (for example, investment analysis, auxiliary decision making and the like according to the image chart information).

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The accompanying drawings are included to provide a further understanding of the disclosure, and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure, without limitation to the disclosure. The above and other features and advantages will become more readily apparent to those skilled in the art by describing in detail exemplary embodiments with reference to the attached drawings, in which:

FIG. 1 is a flow chart of a method for obtaining contextual content provided by an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a method for obtaining context according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of partitioning an image document into different layout blocks provided by an embodiment of the present disclosure;

FIG. 4 is a block diagram of a multimodal visual language model provided by an embodiment of the present disclosure;

fig. 5 is a schematic diagram of a method for inputting training data into a preset neural network structure and calculating a loss value of the neural network structure according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a similarity matrix provided by an embodiment of the present disclosure;

FIG. 7 is a flowchart of a method for obtaining context content of an image chart from an image chart block and context sentence text through a multimodal visual language model provided by an embodiment of the present disclosure;

FIG. 8 is a schematic diagram of a method for obtaining context content of an image chart from an image chart block and context sentence text through a multimodal visual language model according to an embodiment of the present disclosure;

FIG. 9 is a block diagram of a context acquisition device provided by an embodiment of the present disclosure;

fig. 10 is a block diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

For a better understanding of the technical solutions of the present disclosure, exemplary embodiments of the present disclosure will be described below with reference to the accompanying drawings, in which various details of the embodiments of the present disclosure are included to facilitate understanding, and they should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Embodiments of the disclosure and features of embodiments may be combined with each other without conflict.

As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used herein, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and the present disclosure, and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

In the financial field, a great deal of content, such as various financial news of securities markets, stock market broadcasting, industry and company research reports, marketing company financial reports and information notices, product publicity of fund trust insurance companies and the like, is generated every day, the content information often contains charts, tables, characters and the like and exists in the form of rich text format image documents, and many financial institutions collect public financial image documents and extract the content information therein for investment analysis and auxiliary decision making. The chart contains rich information, information display can be performed intuitively, and the chart context also contains a plurality of descriptions and understanding analysis contents of the chart.

Therefore, how to automatically extract charts from the massive financial image documents and mine descriptive analysis content relevant to the charts for subsequent analysis decision becomes an important work for extracting image document information.

The existing extraction method of related image document information basically identifies information such as characters, pictures, tables and charts in an image document based on a document layout analysis (Document Layout Analysis, DLA) technology, then extracts the characters and the contents of the characters in the tables and charts based on an OCR (Optical Character Recongnition, optical character recognition) technology, and cannot realize context semantic content mining and matching of image charts at present.

In view of the above problems, an embodiment of the present disclosure provides a method for mining and matching semantic content of an image graph of an image document, especially for an image document in the financial field.

The scheme of the embodiment of the disclosure increases the application of the multi-mode visual language model based on the DLA technology and the OCR technology, and realizes the mining and matching of the semantic content of the image chart of the image document.

The method for acquiring the context content according to the embodiment of the present disclosure may be performed by an electronic device such as a terminal device or a server, where the terminal device may include, but is not limited to: in-vehicle devices, user Equipment (UE), mobile devices, etc., which may include, but are not limited to: home computers, smart appliances, etc., which may include, but are not limited to: a cellular phone, cordless phone, personal digital assistant (Personal Digital Assistant, PDA), handheld device, portable computing device, smart wearable device, etc., the context acquisition method may be implemented by a processor invoking computer readable program instructions stored in a memory or by a server executing a preset program.

The embodiments of the present disclosure are described in detail below.

The embodiment of the disclosure discloses a method for acquiring context, as shown in fig. 1 and 2, the method may include steps S11-S16:

s11, dividing the image document into a plurality of layout blocks according to the layout format of the image document;

s12, acquiring a text paragraph block and an image chart block in the layout block;

s13, converting the content in the text paragraph blocks into text characters;

s14, acquiring a text of a context paragraph of the image chart block according to the text characters;

s15, converting the text of the context paragraph into text of the context sentence;

s16, acquiring the context content of the image chart block according to the image chart block and the context sentence text through the multi-mode visual language model.

In an embodiment of the present disclosure, dividing an image document into a plurality of layout blocks according to a layout format of the image document may include, but is not limited to:

the image document is input into a preset document layout analysis model, and the image document is divided into a plurality of layout blocks by the document layout analysis model.

In the embodiment of the present disclosure, the document layout analysis model may be a model existing at present, and the detailed model structure is not limited herein.

In an embodiment of the present disclosure, as shown in fig. 3 (fig. 3 is only a schematic diagram of one layout block embodiment of the present disclosure, and text and graphics in the layout block are not limited or protected, and thus the details in the layout block need not be clearly displayed), the document layout analysis model may segment and classify the image document into different layout blocks, for example, including but not limited to: an image chart block and a text paragraph block, which may include, but are not limited to: table blocks, chart blocks, image blocks, etc.

In an embodiment of the present disclosure, converting content in a text paragraph block into text characters includes:

inputting the text paragraph blocks into a preset character recognition model, and recognizing text characters in the text paragraph blocks through the character recognition model.

In the embodiment of the present disclosure, the character recognition model may be a currently existing model, and the detailed model structure is not limited herein. The character recognition model may include, but is not limited to: OCR models, i.e., optical character recognition models.

In an embodiment of the present disclosure, obtaining a context paragraph text of an image chart block from text characters includes:

acquiring text paragraph blocks before and after the image chart block according to the position of the image chart block in the image document, and taking the text paragraph blocks as context paragraph blocks of the image chart block;

and obtaining text characters corresponding to the context paragraph blocks from the text characters, and generating the context paragraph text of the image chart blocks.

In the embodiment of the disclosure, when the layout blocks are divided for the image document, the relative positions among different layout blocks can be recorded, after the image chart block is extracted, the text paragraph blocks before and after the image chart block are acquired according to the recorded relative positions, so that the context paragraph block of the image chart block is obtained, the text characters corresponding to the context paragraph block can be extracted according to the text characters of the text paragraph block already obtained in the previous step, and the context paragraph text corresponding to the context paragraph block can be obtained, namely the context paragraph text of the image chart block.

In the embodiment of the disclosure, after obtaining the context paragraph block of the image chart block, the context paragraph block may be input into a preset character recognition model to obtain text characters corresponding to the context paragraph block, and the context paragraph text of the image chart block may be obtained according to the text characters.

In the embodiment of the present disclosure, a plurality of text paragraph blocks before and after the image chart block may be sampled as a context paragraph block of the image chart block, and the number of samples of the text paragraph block is not limited herein.

In an embodiment of the present disclosure, converting the context paragraph text into context sentence text includes:

and segmenting the text of the context paragraph according to punctuation marks to generate text of the context sentence.

In the embodiment of the disclosure, the context paragraph text is converted into the context sentence text, so that comparison and matching are conveniently carried out on each sentence of content and the image chart block, and a technical basis is provided for improving the content identification accuracy.

In the embodiment of the disclosure, the image chart block and the context sentence text corresponding to the image chart block can be input into the multi-modal visual language model together, and the description text related to the content of the image chart block is mined and matched from the context paragraph text of the image chart block through the multi-modal visual language model.

In the embodiment of the disclosure, before the context content of the image chart block is acquired through the multi-modal visual language model, the neural network structure corresponding to the multi-modal visual language model may be trained first to obtain the multi-modal visual language model with the image chart context extraction function.

In an embodiment of the present disclosure, a method for acquiring a multimodal visual language model may include:

acquiring training data, which may include image sample data of an image chart and text sample data of the content of the image chart, the text sample data comprising sentence text sample data;

inputting training data into a preset neural network structure, and calculating a loss value of the neural network structure; the neural network structure includes: a visual encoder structure, a text encoder structure, a computation layer structure, and an output layer structure;

in case that the loss value satisfies a preset condition, parameters of a visual encoder structure, a text encoder structure, a calculation layer structure and an output layer structure are saved to obtain a visual encoder 101, a text encoder 102, a calculation layer 103 and an output layer 104, and the multi-modal visual language model 100 is obtained as shown in fig. 4.

In the embodiment of the present disclosure, as shown in fig. 5, inputting training data into a preset neural network structure, calculating a loss value of the neural network structure may include steps S21 to S24:

s21, extracting image chart features of image sample data of an image chart through a visual encoder structure to obtain image chart sample feature vectors;

s22, extracting text characteristics of text sample data of image chart content through a text encoder structure to obtain text sample characteristic vectors;

s23, calculating the similarity of the sample feature vector of the image chart and the sample feature vector of the text to obtain a similarity matrix; the similarity matrix comprises corresponding relations between each image chart feature in the image chart sample feature vector and each text feature in the text sample feature vector, and each corresponding relation indicates the similarity degree of the image chart feature and the text feature by adopting a preset sample mark; under the condition that the similarity degree is greater than or equal to a preset second similarity degree threshold value, the sample mark is a first mark, and under the condition that the similarity degree is less than the second similarity degree threshold value, the sample mark is a second mark;

s24, calculating a first cross entropy loss of the image chart sample feature vector and all sample identifiers and a second cross entropy loss of the text sample feature vector and all sample identifiers, and calculating a mean value of the first cross entropy loss and the second cross entropy loss as a loss value of the neural network structure.

In the disclosed embodiment, the training data of the multimodal visual language model is an image chart (i.e., image sample data) and corresponding image chart description text (i.e., text sample data) pairs of description text. During model training, an image chart is input into a visual encoder structure, a description text is input into a text encoder structure, and a vector embedded representation Ii (namely an image chart sample feature vector) of the image chart and an embedded representation vector Ti (namely a text sample feature vector) of the text description are respectively obtained. And then, respectively calculating cosine similarity of the vector Ii and the vector Ti to obtain a similarity matrix, wherein positive samples (namely, the similarity degree of the image chart features and the text features is larger than or equal to a second similarity threshold value) are marked on diagonal lines, label=1, and the rest negative samples (namely, the similarity degree of the image chart features and the text features is smaller than the second similarity threshold value), and label=0. And then respectively calculating the cross entropy loss of the vector Ii and all labels and the cross entropy loss of the vector Ti and all Label, and training the neural network structure in a mode of minimizing the average value of the two cross entropy losses to obtain the multi-modal visual language model.

In an embodiment of the present disclosure, as shown in fig. 6, taking a chart as an example for the embodiment of the present disclosure, a chart (e.g., chart 1, chart 2, chart 3, chart 4, chart 5) including an image format and a chart corresponding text (e.g., chart 1 text, chart 2 text, chart 3 text, chart 4 text, chart 5 text) in fig. 6 is input to a visual Encoder (i.e., visual Encoder) to obtain image chart sample feature vectors (I1, I2, I3, I4, I5), a chart corresponding text is input to a text Encoder (i.e., text Encoder) to obtain text sample feature vectors (T1, T2, T3, T4, T5), each image chart feature in the image chart sample feature vector and each text feature in the text sample feature vector respectively correspond to a similarity matrix formed by the sample identifications IiTj, I and j are 1, 2, 3, 4 and 5, for example, I1T1, I1T2, I1T3, I1T4, I1T5, I2T1, I2T2, I2T3, I2T4, I2T5, I3T1, I3T2, I3T3, I3T4, I3T5, I4T1, I4T2, I4T3, I4T4, I4T5, I5T1, I5T3, I5T4 and I5T5, wherein I1T1, I2T2, I3T3, I4T4 and I5T5 are positive samples, and the rest are negative samples in the similarity matrix.

In embodiments of the present disclosure, the multimodal visual language model 100 may include, but is not limited to: a visual encoder 101, a text encoder 102, a computation layer 103 and an output layer 104.

In the embodiment of the present disclosure, as shown in fig. 7, obtaining, by the multimodal visual language model, the context content of the image chart block according to the image chart block and the context sentence text may include steps S31-S34:

s31, extracting image chart features of the image chart blocks through the visual encoder 101 to obtain image chart feature vectors;

s32, extracting text characteristics of each sentence in the context sentence text through the text encoder 102, and obtaining sentence text characteristic vectors corresponding to each sentence;

s33, respectively calculating the similarity between the image chart feature vector and the sentence text feature vector of each sentence through the calculation layer 103;

s34, sentences corresponding to sentence text feature vectors with the similarity of the image chart feature vectors being greater than or equal to a preset first similarity threshold are acquired through the output layer 104, and the acquired sentences are combined to obtain the context content of the image chart block.

In the embodiment of the disclosure, in the multi-modal visual language model reasoning stage, an image chart block in an image format identified by a document layout analysis model can be input into a visual encoder to obtain an embedded vector of an image chart, namely an image chart feature vector; the text of the context paragraph of the image chart extracted by the OCR model can be segmented into text of context sentences, a sentence list (the sentence list comprises one or more text of the context sentences) is generated, and the sentence list is input into a text encoder to obtain an embedded vector of each sentence, namely a sentence text feature vector; then, the similarity between the feature vector of the image chart and the text feature vector of the sentence corresponding to each sentence can be calculated respectively, for example, the method can include but is not limited to calculating cosine similarity, and sentences corresponding to the text feature vectors of the sentences with the similarity greater than or equal to the first similarity threshold value are screened out from the calculated multiple similarities according to the preset first similarity threshold value, and the sentences can be combined to obtain the context corresponding to the image chart block as the sentences contained in the context of the image chart.

In the embodiment of the disclosure, as shown in fig. 8, still taking a chart as an example, the embodiment of the disclosure is illustrated, fig. 8 includes a chart (i.e., an image chart) in an image format and a text corresponding to the chart (e.g., texts of a plurality of sentences included in a chart context paragraph, for example, sentence 1, sentence 2, sentence 3, sentence 4, sentence 5), the image chart is input to a visual Encoder (i.e., a visual Encoder) to obtain a chart vector (i.e., the image chart feature vector), sentence 1, sentence 2, sentence 3, sentence 4, sentence 5 is input to a text Encoder (i.e., the text Encoder) to obtain a corresponding plurality of sentence vectors (i.e., the sentence text feature vectors of the sentence, for example, sentence vector 1, sentence vector 2, sentence vector 3, sentence vector 4, sentence vector 5), and cosine similarity is calculated for each sentence vector to obtain a corresponding plurality of similarities it_1, it_2, it_3, it_4, it_5. The similarity greater than or equal to a preset first similarity threshold value can be screened out from the similarities it_1, it_2, it_3, it_4 and it_5, sentences corresponding to the screened similarity are obtained and used as sentences contained in the context of the image chart block, and the sentences are further combined to obtain the context of the image chart block.

In an embodiment of the present disclosure, combining the obtained sentences to obtain the context content of the image chart block includes:

and combining the sentences according to a preset template and/or splicing the sentences according to a preset sequence to obtain the context content of the image chart block.

In the embodiment of the present disclosure, a distribution form of sentences may be set in the preset template, for example, each sentence may be used as a content gist, and each sentence may be displayed in a content gist form.

In embodiments of the present disclosure, the preset sequence may include, but is not limited to: the sentences can be spliced in sequence according to the sequence of the sentences in the context paragraph text.

In the embodiment of the disclosure, the image chart and the image chart context paragraph text in the image document are extracted by creatively combining the document layout analysis technology and the optical character recognition technology, the context paragraph text is converted into the context sentence text, and the description text corresponding to the content reflected by the image chart and the image chart context sentence text is automatically mined by adopting a multi-modal visual language model, so that the image chart in the image document is automatically extracted, the corresponding image chart content description text is automatically mined and matched, and accurate data support is provided for subsequent tasks (such as investment analysis, auxiliary decision making and the like according to the image chart information).

In the embodiment of the disclosure, the mining and matching of the descriptive text of the image chart in the image document can also be realized in a regular matching mode, the labels and the titles of the image chart are identified, and corresponding text sentences are screened from the upper text and the lower text of the image chart by a regular matching method to form the context content of the image chart.

The embodiment of the present disclosure provides a context acquisition apparatus 200, as shown in fig. 9, may include:

a dividing module 201 for dividing the image document into a plurality of layout blocks according to the layout format of the image document;

a first obtaining module 202, configured to obtain a text paragraph block and an image chart block in the layout block;

the character conversion module 203 is configured to convert the content in the text paragraph block into text characters;

a second obtaining module 204, configured to obtain a context paragraph text of the image chart block according to the text characters;

a text conversion module 205, configured to convert the text of the context paragraph into text of a context sentence;

a third obtaining module 206, configured to obtain, through the multi-modal visual language model, the context content of the image chart block according to the image chart block and the context sentence text.

The embodiment of the present disclosure provides an electronic device 300, as shown in fig. 10, the electronic device 300 includes:

at least one processor 301; and

a memory 302 communicatively coupled to the at least one processor 301; wherein, the liquid crystal display device comprises a liquid crystal display device,

the memory 302 stores one or more computer programs executable by the at least one processor 301, the one or more computer programs being executable by the at least one processor 301 to enable the at least one processor 301 to perform the context content acquisition method described above.

In an embodiment of the present disclosure, the electronic device 300 may further include: one or more I/O (input/output) interfaces 303 are connected between the processor 301 and the memory 302.

The disclosed embodiments also provide a computer readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the context acquisition method described above. The computer readable storage medium may be a volatile or nonvolatile computer readable storage medium.

Embodiments of the present disclosure also provide a computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when executed in a processor of an electronic device, performs the above-described context acquisition method.

Those of ordinary skill in the art will appreciate that all or some of the steps, systems, functional modules/units in the apparatus, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between the functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed cooperatively by several physical components. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer-readable storage media, which may include computer storage media (or non-transitory media) and communication media (or transitory media).

The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable program instructions, data structures, program modules or other data, as known to those skilled in the art. Computer storage media includes, but is not limited to, random Access Memory (RAM), read Only Memory (ROM), erasable Programmable Read Only Memory (EPROM), static Random Access Memory (SRAM), flash memory or other memory technology, portable compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical disc storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable program instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and may include any information delivery media.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

Computer program instructions for performing the operations of the present disclosure can be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, c++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.

The computer program product described herein may be embodied in hardware, software, or a combination thereof. In an alternative embodiment, the computer program product is embodied as a computer storage medium, and in another alternative embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), or the like.

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Example embodiments have been disclosed herein, and although specific terms are employed, they are used and should be interpreted in a generic and descriptive sense only and not for purpose of limitation. In some instances, it will be apparent to one skilled in the art that features, characteristics, and/or elements described in connection with a particular embodiment may be used alone or in combination with other embodiments unless explicitly stated otherwise. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the disclosure as set forth in the appended claims.

Claims

1. A method of contextual content acquisition, the method comprising:

acquiring a text paragraph block and an image chart block in the layout block;

converting the content in the text paragraph blocks into text characters;

converting the text of the context paragraph into text of a context sentence;

2. The method of claim 1, wherein the obtaining the text of the text paragraph of the context of the image chart block from the text character comprises:

and acquiring text characters corresponding to the context paragraph blocks from the text characters, and generating the context paragraph text of the image chart blocks.

3. The method of claim 1, wherein said converting the context paragraph text into context sentence text comprises:

and segmenting the text of the context paragraph according to punctuation marks to generate the text of the context sentence.

4. The contextual content acquisition method according to claim 1, wherein the multimodal visual language model comprises: a visual encoder, a text encoder, a computation layer and an output layer;

the obtaining, by the multimodal visual language model, the context content of the image chart block according to the image chart block and the context sentence text includes:

extracting image chart features of the image chart blocks through the visual encoder to obtain image chart feature vectors;

extracting text characteristics of each sentence in the context sentence text through the text encoder, and obtaining sentence text characteristic vectors corresponding to each sentence;

calculating the similarity between the image chart feature vector and the sentence text feature vector of each sentence through the calculation layer;

and acquiring sentences corresponding to sentence text feature vectors with the similarity of the image chart feature vectors being greater than or equal to a preset first similarity threshold value through the output layer, and combining the acquired sentences to obtain the context content of the image chart block.

5. The method for obtaining contextual content according to claim 4, wherein the method for obtaining the multimodal visual language model comprises:

acquiring training data, wherein the training data comprises image sample data of an image chart and text sample data of the content of the image chart;

inputting the training data into a preset neural network structure, and calculating a loss value of the neural network structure; the neural network structure includes: a visual encoder structure, a text encoder structure, a computation layer structure, and an output layer structure;

and under the condition that the loss value meets the preset condition, parameters of the visual encoder structure, the text encoder structure, the calculation layer structure and the output layer structure are stored to obtain the visual encoder, the text encoder, the calculation layer and the output layer, and the multi-mode visual language model is obtained.

6. The method for obtaining context according to claim 5, wherein inputting the training data into a preset neural network structure, calculating a loss value of the neural network structure, comprises:

extracting image chart features of the image sample data of the image chart through the visual encoder structure to obtain image chart sample feature vectors;

extracting text characteristics of text sample data of the image chart content through the text encoder structure to obtain text sample characteristic vectors;

calculating the similarity of the image chart sample feature vector and the text sample feature vector to obtain a similarity matrix; the similarity matrix comprises corresponding relations between each image chart feature in the image chart sample feature vector and each text feature in the text sample feature vector, and each corresponding relation indicates the similarity degree of the image chart feature and the text feature by adopting a preset sample mark; the sample mark is a first mark when the similarity degree is larger than or equal to a preset second similarity degree threshold value, and is a second mark when the similarity degree is smaller than the second similarity degree threshold value;

and calculating a first cross entropy loss of the image chart sample feature vector and all the sample identifications and a second cross entropy loss of the text sample feature vector and all the sample identifications, and calculating an average value of the first cross entropy loss and the second cross entropy loss as a loss value of the neural network structure.

7. The method for obtaining context according to claim 4, wherein the combining the obtained sentences to obtain the context of the image chart block comprises:

8. A contextual content acquisition device, comprising:

9. An electronic device, comprising:

at least one processor; and

the memory stores one or more computer programs executable by the at least one processor to enable the at least one processor to perform the contextual content acquisition method according to any of claims 1-7.

10. A computer readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the context content retrieval method according to any of claims 1-7.