CN116029277B

CN116029277B - Multi-mode knowledge analysis method, device, storage medium and equipment

Info

Publication number: CN116029277B
Application number: CN202211625685.XA
Authority: CN
Inventors: 杨娟; 翟士丹; 林健
Original assignee: Beijing Haizhi Xingtu Technology Co ltd
Current assignee: Beijing Haizhi Xingtu Technology Co ltd
Priority date: 2022-12-16
Filing date: 2022-12-16
Publication date: 2024-04-05
Anticipated expiration: 2042-12-16
Also published as: CN116029277A

Abstract

The invention provides a method, a device, a storage medium and equipment for multi-modal knowledge analysis, wherein the method comprises the following steps: and receiving the knowledge to be analyzed, analyzing and storing metadata of the knowledge to be analyzed by utilizing an analysis engine, wherein the metadata at least comprises document types, the document types comprise document texts, pictures and audio/video texts, analyzing and storing document text contents, and analyzing and storing pictures and audio/video texts. The invention can analyze knowledge of different types, and has convenient analysis and high analysis efficiency.

Description

Multi-mode knowledge analysis method, device, storage medium and equipment

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method, an apparatus, a storage medium, and a device for multi-modal knowledge analysis.

Background

With the development of technology, people generate various documents, pictures and audio-video data in daily life and work, and the unstructured data is difficult to use, and when people need to acquire wanted data from the knowledge, the unstructured data needs to be converted into structured data.

The prior art processes a single class of knowledge, and is inconvenient when multi-modal knowledge is involved.

Disclosure of Invention

In view of the above, the present invention provides a multi-modal knowledge analysis method, apparatus, storage medium and device, which can analyze different types of knowledge, and has convenient analysis and high analysis efficiency.

In a first aspect, an embodiment of the present invention provides a method for multi-modal knowledge resolution, where the method includes:

receiving knowledge to be analyzed;

analyzing and storing metadata of the knowledge to be analyzed by utilizing an analysis engine, wherein the metadata at least comprises document types, and the document types comprise document texts, pictures and audio/video texts;

analyzing and storing the text content of the document;

and analyzing and storing the pictures and the audio/video texts.

Further, the metadata also includes title, document size, author, version number, document classification, tag.

Further, the parsing the metadata of the knowledge to be parsed by using the parsing engine includes:

the analysis engine automatically establishes the association relation of each knowledge according to the author of the knowledge and the document type;

the analysis engine classifies and tags the documents of each knowledge according to the titles and authors of the knowledge;

and creating a metadata index and storing the association relation of the metadata and the knowledge in a relational database.

Further, parsing and storing the document text content includes:

identifying characters on pictures in the document text by using an optical character identification method;

analyzing the content in the document text and the document obtained after the text recognition by using the multi-level title, the page number and the paragraph;

establishing a corresponding relation between the multi-level title index and the page index and between the multi-level title index and the paragraph index, and storing the corresponding relation in the full text index;

vectorizing the pictures in the document text and the contents and titles of the audios and videos by using a vectorizing algorithm;

and storing the pictures, the audio and video contents and the titles in the document text after vectorization in a vectorization database.

Further, a correspondence relationship between the multi-level title index and the page index, and between the paragraph index and the metadata index is established.

Further, establishing the multi-level title index includes:

the title is divided into at least two levels of sub-titles, and each sub-title is expanded;

the upper level title and the current subtitle of each subtitle are used as title indexes.

Further, parsing and storing the picture, the audio and video text includes:

vectorizing the contents and titles of the pictures and the audio/video texts by using a vectorization algorithm;

and storing the content and the title of the vectorized picture and the audio/video in a vectorized database.

In a second aspect, an embodiment of the present invention provides an apparatus for multi-modal knowledge resolution, where the apparatus includes:

the receiving module is used for receiving the knowledge to be analyzed;

the metadata analysis module is used for analyzing and storing metadata of the knowledge to be analyzed by utilizing an analysis engine, wherein the metadata at least comprises document types, and the document types comprise document texts, pictures and audio/video texts;

the document content analysis module is used for analyzing and storing the document text content;

and the picture and audio/video content analysis module is used for analyzing and storing the picture and audio/video text.

In a third aspect, an embodiment of the present invention provides a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the method of any of the first aspects when run.

In a fourth aspect, an embodiment of the invention provides an apparatus comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the method of any of the first aspects.

According to the technical scheme provided by the invention, the metadata of the knowledge to be analyzed is analyzed and stored by utilizing the analysis engine, and then knowledge contents of different file types are analyzed in different modes according to the file types in the metadata. Therefore, the method and the device can analyze various types of knowledge, and are high in analysis speed and analysis efficiency.

The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.

Drawings

FIG. 1 is a flowchart of a multi-modal knowledge parsing method provided by an embodiment of the present invention;

FIG. 2 is a timing diagram of a multi-modal knowledge parsing method according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a multi-modal knowledge analysis device according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, fig. 1 is a flowchart of a multi-modal knowledge parsing method according to an embodiment of the present invention, where the method includes the following steps:

and step 101, receiving knowledge to be analyzed.

In this step, the user uploads knowledge to be parsed, which may be various types of unstructured knowledge, such as txt, doc, pdf, pictures, audio, video, etc.

And 102, analyzing and storing metadata of the knowledge to be analyzed by utilizing an analysis engine, wherein the metadata at least comprises a document type, and the document type comprises document text, pictures and audio/video text.

In the step, metadata of knowledge to be parsed is parsed and stored by a parsing engine, wherein the metadata at least comprises document types, and the document types comprise document texts, pictures and audio-video texts. The document text refers to the knowledge to be analyzed, which is a document, or the document contains pictures and audios and videos, and the pictures and the audios and videos refer to the knowledge to be analyzed, which is a picture or audios and videos.

In some embodiments, parsing and storing metadata of knowledge to be parsed using a parsing engine may include:

step 121, the parsing engine automatically establishes the association relation of each knowledge according to the author of the knowledge and the document type;

step 122, the analysis engine classifies and labels the documents of each knowledge according to the title and author of the knowledge;

step 123, creating a metadata index and storing the metadata and the association relation of each knowledge in a relational database.

That is, after metadata of knowledge to be analyzed is obtained by using an analysis engine, the analysis engine automatically establishes association relations of the knowledge according to authors and document types of the knowledge in the metadata, classifies and labels documents of the knowledge according to titles and authors of the knowledge, then stores the association relations of the knowledge and the metadata in a relational database, and stores metadata indexes in a full-text index database.

And step 103, analyzing and storing the text content of the document.

In this step, the document text may include two types, the first type being that the document text is a document, and the second type being that the document text contains text and/or a screenshot of a picture and/or audio and/or video.

When the text of the document is text, parsing the text content of the document can include parsing the text content of the document with multi-level titles, page numbers and paragraphs, and then establishing a correspondence between multi-level title indexes, page number indexes and paragraph indexes and storing the correspondence in the full-text indexes.

When the text of the document contains text and/or screenshot of picture and/or audio and/or video, firstly, the text in the picture is recognized by using an Optical Character Recognition (OCR) method, or the audio and/or video in the text of the document is converted into text by using a voice-to-text method, then, the text of the document obtained after the text recognition or the text obtained by voice-to-text is analyzed by using multi-level titles, page numbers and paragraphs, and then, the corresponding relation between multi-level title indexes, page number indexes and paragraph indexes is established and stored in the full text indexes. And then carrying out vectorization processing on the contents and titles of the pictures and the audios and videos in the document text by using a vectorization algorithm, and storing the contents and titles of the pictures and the audios and videos in the document text after vectorization processing in a vectorization database.

In some embodiments, the method further includes establishing correspondence between the multi-level title index and the page index, and between the paragraph index and the metadata index, such that each document establishes an association.

In some embodiments, establishing the multi-level title index may be accomplished by:

step 1031, dividing the title into at least two levels of sub-titles, and expanding each sub-title;

step 1032, using the upper level title and the current subtitle of each subtitle as the title index.

And 104, analyzing and storing the picture, the audio and video text.

In this step, the parsing and storing of the picture and the audio/video text may be implemented by the following steps:

step 1041, performing vectorization processing on the content and the title of the picture and the audio/video text by using a vectorization algorithm;

step 1042, store the content and title of the vectorized picture and audio/video in the vectorized database.

Referring to fig. 2, fig. 2 is a timing chart of a multi-modal knowledge analysis method according to an embodiment of the invention.

First, the client uploads knowledge to be parsed.

Then, the analysis engine analyzes the metadata of the knowledge to be analyzed to obtain document types, titles, document sizes, authors, version numbers, document classifications, labels and the like, the association relation of each knowledge is automatically established according to the authors and the document types of the knowledge, the document classifications and labeling are carried out on each knowledge according to the titles and the authors of the knowledge, metadata indexes are created and stored in a full-text retrieval database, and the association relation of the metadata and each knowledge is stored in a relational database.

And then, aiming at the document text, identifying the characters on the pictures in the document text by using an optical character identification method, analyzing the contents in the document text and the documents obtained after character identification by using multi-stage titles, page numbers and paragraphs, establishing corresponding relations between multi-stage title indexes and page number indexes and paragraph indexes, storing the corresponding relations in full-text indexes, carrying out vectorization processing on the contents and titles of the pictures and the audios and videos in the document text by using a vectorization algorithm, storing the contents and the titles of the pictures and the audios and videos in the document text after vectorization processing in a vectorization database, and establishing corresponding relations between the multi-stage title indexes and page number indexes and paragraph indexes and metadata indexes.

Then, vectorizing the contents and titles of the pictures and the audio/video texts by using a vectorization algorithm, and storing the vectorized contents and titles of the pictures and the audio/video texts in a vectorization database.

Therefore, the technical scheme provided by the embodiment of the invention abstracts common point metadata of various types of knowledge, and extracts multi-mode data according to knowledge differences. The document content uses pages, paragraphs and multi-level titles to carry out structured data extraction, the OCR technology is used for identifying text information on pictures in the document, and the vectorization algorithm is used for carrying out vectorization processing on the contents and titles of the pictures and the audio/video contents. Classifying documents through knowledge metadata, marking, establishing association relations among the knowledge, and establishing information association of isolated knowledge. Meanwhile, a relational database, a full-text index database and a vectorization database are used for storing structured data, and the characteristics of various storage tools are fully utilized for maintaining analysis data.

Referring to fig. 3, fig. 3 is a block diagram of a multi-modal knowledge analysis device according to an embodiment of the present invention, where the device includes:

a receiving module 31, configured to receive knowledge to be parsed;

the metadata parsing module 32 is configured to parse and store metadata of the knowledge to be parsed by using a parsing engine, where the metadata includes at least document types, and the document types include document text and pictures and audio/video text;

a document content parsing module 33, configured to parse and store the document text content;

the picture and audio/video content parsing module 34 is configured to parse and store the picture and audio/video text.

In some embodiments, the metadata further includes a title, a document size, an author, a version number, a document classification, a tag.

In some embodiments, metadata parsing module 32 may include:

the association relationship unit 321 is configured to automatically establish an association relationship of each knowledge by the parsing engine according to the author of the knowledge and the document type;

a classification unit 322, configured to classify and tag the documents of the knowledge according to the title and author of the knowledge by the parsing engine;

and a storage unit 323, configured to create a metadata index and store the metadata and the association relationship of each knowledge in a relational database.

In some embodiments, the document content parsing module 33 may include:

a character recognition unit 331, configured to recognize characters on a picture in the document text by using an optical character recognition method;

a content parsing unit 332, configured to parse a document obtained after content and text in the document text are identified using a multi-level header, a page number, and a paragraph;

a correspondence unit 333, configured to establish a correspondence between the multi-level title index and the page index, and between the multi-level title index and the paragraph index, and store the correspondence in the full-text index;

the vectorization unit 334 is configured to perform vectorization processing on the content and the title of the picture and the audio/video in the document text by using a vectorization algorithm;

the storage unit 335 is configured to store the pictures in the document text and the content and the title of the audio and video after the vectorization in the vectorization database.

In some embodiments, the correspondence unit 333 is further configured to establish a correspondence between the multi-level title index and the page index, and between the paragraph index and the metadata index.

In some embodiments, picture and audiovisual content parsing module 34 may include:

a vectorization unit 341, configured to perform vectorization processing on the content and the title of the picture and the audio/video text by using a vectorization algorithm;

the vectorization storage unit 342 is configured to store the content and the title of the vectorized picture and audio/video in the vectorization database.

It should be noted that, the multi-modal knowledge analysis device in the embodiment of the present invention and the multi-modal knowledge analysis method in the above embodiment belong to the same inventive concept, and technical details not described in the device may be referred to the related description of the method, which is not repeated herein.

Furthermore, the embodiment of the invention also provides a storage medium, wherein the storage medium stores a computer program, and the computer program is configured to execute the method when running.

Furthermore, embodiments of the present invention provide a computer program product comprising a computer program which, when executed by a processor, enables the implementation of the method as described above.

Fig. 4 shows a schematic diagram of the structure of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 4, the electronic device 10 includes at least one processor 11, and a memory, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, etc., communicatively connected to the at least one processor 11, in which the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM13, various programs and data required for the operation of the electronic device 10 may also be stored. The processor 11, the ROM12 and the RAM13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.

Various components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, etc.; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 11 performs the various methods and processes described above, such as the idle detection method.

In some embodiments, the idle detection method may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as the storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM12 and/or the communication unit 19. When the computer program is loaded into RAM13 and executed by processor 11, one or more steps of the idle detection method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the idle detection method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A method of multimodal knowledge resolution, the method comprising:

receiving knowledge to be analyzed;

analyzing and storing the document text content, wherein the document content uses pages, paragraphs and multi-level titles to extract structured data;

analyzing and storing the pictures and the audio/video texts, wherein the pictures in the document are identified to be text information on the pictures by using an OCR (optical character recognition) technology, and the pictures and the audio/video contents are subjected to content and title vectorization by using a vectorization algorithm;

the metadata also includes title, document size, author, version number, document classification, tag;

the parsing the metadata of the knowledge to be parsed by using the parsing engine and storing the metadata comprises the following steps:

2. The method of claim 1, wherein parsing and storing the document text content comprises:

3. The method according to claim 2, wherein the method further comprises: and establishing a corresponding relation between the multi-level title index and the page index, and between the paragraph index and the metadata index.

4. The method of claim 2, wherein establishing a multi-level title index comprises:

5. The method of claim 1, wherein parsing and storing the picture and audiovisual text comprises:

6. An apparatus for multimodal knowledge resolution, the apparatus comprising:

the receiving module is used for receiving the knowledge to be analyzed;

the document content analysis module is used for analyzing and storing the document text content, wherein the document content uses pages, paragraphs and multi-level titles to extract structured data;

the picture and audio/video content analysis module is used for analyzing and storing the picture and audio/video text, wherein the picture in the document uses OCR recognition technology to recognize the text information on the picture, and uses vectorization algorithm to carry out content and title vectorization processing on the picture and audio/video content;

the metadata parsing module is specifically configured to:

7. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program, wherein the computer program is arranged to execute the method of any of the claims 1 to 5 when run.

8. A computer device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to run the computer program to perform the method of any of the claims 1 to 5.