CN117763206A - Knowledge tree generation method and device, electronic equipment and storage medium - Google Patents

Knowledge tree generation method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN117763206A
CN117763206A CN202410186566.1A CN202410186566A CN117763206A CN 117763206 A CN117763206 A CN 117763206A CN 202410186566 A CN202410186566 A CN 202410186566A CN 117763206 A CN117763206 A CN 117763206A
Authority
CN
China
Prior art keywords
knowledge
title
data
processed
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410186566.1A
Other languages
Chinese (zh)
Inventor
罗歆昱
陈崇雨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
DMAI Guangzhou Co Ltd
Original Assignee
DMAI Guangzhou Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by DMAI Guangzhou Co Ltd filed Critical DMAI Guangzhou Co Ltd
Priority to CN202410186566.1A priority Critical patent/CN117763206A/en
Publication of CN117763206A publication Critical patent/CN117763206A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a knowledge tree generation method, a knowledge tree generation device, electronic equipment and a storage medium, wherein the knowledge tree generation method comprises the following steps: acquiring a data source to be processed, wherein the data source to be processed comprises data files in various file formats; carrying out file analysis processing on a data source to be processed to obtain a data text to be processed with a data exchange format; determining the position information of a primary title, a secondary title and the secondary title from a data text to be processed; performing title extraction on a data list to be processed, and generating a knowledge subtree list by using the extracted titles, wherein the knowledge subtree list comprises a plurality of knowledge subtrees, and the title level of any title in each knowledge subtree is smaller than the title level of a secondary title; and combining the primary title, the secondary title and the knowledge subtree list according to the position information of the secondary title to obtain a knowledge tree. The method and the device are widely applied, not only can the operation efficiency of a user be improved, but also the data files with various file formats can be integrated into the knowledge tree accurately and rapidly.

Description

Knowledge tree generation method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of information technologies, and in particular, to a method and an apparatus for generating a knowledge tree, an electronic device, and a storage medium.
Background
The knowledge tree is an orderly and clear knowledge representation mode with a hierarchical structure, is generally used for organizing and managing a large amount of knowledge, and is convenient for users to browse and inquire the required information.
Traditional knowledge tree generation techniques are mainly directed to text data, and generally require manual classification and generalization of knowledge, and then manually build a hierarchical structure to represent relationships between different knowledge. However, with the increasing number of multi-modal data sources (such as text, image, audio, video, etc.), the conventional knowledge tree generation technology cannot well process multi-modal data, and how to accurately and rapidly integrate the multi-modal data sources into the knowledge tree becomes a problem to be solved.
Disclosure of Invention
In view of the above, an object of the present application is to provide a method, an apparatus, an electronic device, and a storage medium for generating a knowledge tree, which can generate a knowledge tree based on to-be-processed data sources of data files in multiple file formats, and has wide application, without manual operation, so that not only can the operation efficiency of a user be improved, but also the data files in multiple file formats can be integrated into the knowledge tree accurately and rapidly.
In a first aspect, an embodiment of the present application provides a method for generating a knowledge tree, where the method includes:
acquiring a data source to be processed; wherein the data source to be processed comprises data files in a plurality of file formats;
carrying out file analysis processing on the data source to be processed to obtain a data text to be processed with a data exchange format;
determining the position information of a primary title, a secondary title and the secondary title from the data text to be processed; the position information of the secondary title is determined by a to-be-processed data list obtained after the secondary title is used for carrying out segmentation processing on the to-be-processed data text;
performing title extraction on a data list to be processed, and generating a knowledge subtree list by using the extracted titles, wherein the knowledge subtree list comprises a plurality of knowledge subtrees, and the title level of any title in each knowledge subtree is smaller than the title level of the secondary title;
and merging the primary title, the secondary title and the knowledge subtree list according to the position information of the secondary title to obtain a knowledge tree.
In an optional embodiment of the present application, the plurality of file formats includes a document format and/or a video format, and the performing file parsing on the to-be-processed data source to obtain to-be-processed data text with a data exchange format includes:
Analyzing the data files in the document format by utilizing resolvers corresponding to different document formats to obtain a first data text with a data exchange format, wherein the first data text comprises characters, font sizes of the characters and position information of the characters; or converting the data file in the document format into the data file in the picture format, and converting the data file in the picture format into a second data text in the data exchange format by utilizing a character recognition technology, wherein the second data text comprises text segments corresponding to a plurality of pictures, position information of the text segments, color information of the text segments and font size of the text segments;
extracting a picture to be processed from a data file aiming at the data file in the video format, and converting the picture to be processed into a third data text in the data exchange format by utilizing a character recognition technology; the third data text comprises text segments corresponding to the pictures to be processed, position information of the text segments, color information of the text segments and font sizes of the text segments.
In an optional embodiment of the present application, the extracting the to-be-processed picture from the data file for the data file in the video format, and converting the to-be-processed picture into the third data text in the data exchange format by using the character recognition technology includes:
Drawing frames of the data file in the video format according to a preset time interval to obtain a plurality of first pictures to be processed;
calculating first similarity between adjacent first pictures to be processed, and performing de-duplication on the first pictures to be processed, of which the first similarity is larger than a first preset similarity threshold value, so as to obtain second pictures to be processed, of which repeated pictures are not present;
performing character recognition on the second to-be-processed picture by using a character recognition technology, acquiring a character recognition result of the second to-be-processed picture, and filtering the second to-be-processed picture without characters to acquire a character recognition result of the remaining second to-be-processed picture;
and calculating second similarity between character recognition results of adjacent remaining second to-be-processed pictures, and performing de-duplication on the character recognition results of the remaining second to-be-processed pictures with the second similarity being greater than a second preset similarity threshold value to obtain a third data text with a data exchange format.
In an optional embodiment of the present application, the data text includes at least one of a first data text, a second data text, and a third data text, and determining the position information of the primary title, the secondary title, and the secondary title from the data text to be processed includes:
Determining a primary title and a secondary title according to the font size of characters in the first data text and the position information of the characters, and carrying out segmentation processing on the data text to be processed by utilizing the secondary title to obtain a data list to be processed; determining the position information of the secondary title according to the page number corresponding to the data list to be processed;
and/or determining a primary title and a secondary title according to color information of text segments corresponding to the plurality of pictures in the second data text, font size of the text segments and position information of the text segments, and carrying out segmentation processing on the data text to be processed to obtain a data list to be processed corresponding to the plurality of pictures, wherein the number of pages of the data list to be processed is the same as the number of the plurality of pictures; determining the position information of a secondary title according to a data list to be processed corresponding to a plurality of pictures; the second-level title determined from the second data text is obtained by performing duplicate removal processing on text segments corresponding to a plurality of pictures;
and/or determining a first-level title and a second-level title according to color information of text segments corresponding to the plurality of pictures to be processed in the third data text, font sizes of the text segments and position information of the text segments, and carrying out segmentation processing on the data text to be processed to obtain a data list to be processed corresponding to the plurality of pictures to be processed, wherein the number of pages of the data list to be processed is the same as the number of the pictures to be processed; and determining the position information of a secondary title according to a to-be-processed data list corresponding to the plurality of to-be-processed pictures, wherein the secondary title determined from the third data text is obtained by performing duplicate removal processing on text segments corresponding to the plurality of to-be-processed pictures.
In an optional embodiment of the present application, the extracting a title from the to-be-processed data list, generating a knowledge subtree list by using the extracted title includes:
inputting the first data text into a subtree generation model, and outputting a plurality of knowledge subtrees to form a knowledge subtree list;
and/or determining each level of title according to the position information of the text segment corresponding to the plurality of pictures included in the second data text, the color information of the text segment and the font size of the text segment, and generating a plurality of knowledge subtrees according to each level of title to form a knowledge subtree list; wherein each picture corresponds to a knowledge sub-tree;
and/or determining each level of title according to the position information of the text segment corresponding to the plurality of pictures to be processed, the color information of the text segment and the font size of the text segment, which are included in the third data text, and generating a plurality of knowledge subtrees according to each level of title to form a knowledge subtree list; wherein each picture to be processed corresponds to a knowledge sub-tree.
In an optional embodiment of the present application, the merging the primary title, the secondary title and the knowledge subtree list according to the location information of the secondary title to obtain a knowledge tree includes:
Inserting the secondary title into the knowledge subtree list according to the position information of the secondary title to obtain a knowledge subtree expansion list; wherein each secondary topic in the knowledge sub-tree expansion list contains all knowledge sub-trees ordered after the secondary topic and ordered before the next secondary topic;
calculating first semantic similarity between any two secondary titles, de-duplicating the secondary titles with the first semantic similarity being larger than a first preset semantic similarity threshold according to second semantic similarity between the secondary titles and a knowledge subtree contained in the secondary titles, and updating the knowledge subtree expansion list to obtain a first knowledge subtree expansion list;
traversing all knowledge subtrees contained in the second-level title aiming at each second-level title in the first knowledge subtree expansion list, and calculating a first association degree between the second-level title and a root node of each knowledge subtree contained in the second-level title;
calculating third semantic similarity between any two knowledge subtrees contained in each secondary title in the first knowledge subtree expansion list, de-duplicating the knowledge subtrees with the third semantic similarity larger than a second preset semantic similarity threshold according to the first association degree, and updating the first knowledge subtree expansion list to obtain a second knowledge subtree expansion list;
Traversing all knowledge subtrees contained in the secondary titles aiming at each secondary title in the second knowledge subtree expansion list, and calculating fourth semantic similarity between the root node of the target knowledge subtree and leaf nodes of other knowledge subtrees; the target knowledge subtree is any knowledge subtree contained in a secondary title in the second knowledge subtree expansion list, and the other knowledge subtrees are residual knowledge subtrees except the target knowledge subtree of all the knowledge subtrees contained in the secondary title;
adjusting the hierarchical relationship between the target knowledge subtree with the fourth semantic similarity larger than a third preset semantic similarity threshold and other knowledge subtrees to enable the target knowledge subtree to be combined with the other knowledge subtrees, and updating the second knowledge subtree expansion list to obtain a third knowledge subtree expansion list;
and generating a knowledge tree according to the first-level title and the third knowledge subtree expansion list.
In an optional embodiment of the present application, the first semantic similarity, the second semantic similarity, the third semantic similarity, and the fourth semantic similarity are all determined according to a weighted result between an editing distance and a cosine similarity;
The first association degree is determined according to the weighted result of the editing distance, cosine similarity and relative position compactness between the second-level title in the first knowledge subtree expansion list and the root node of each knowledge subtree contained in the second-level title, and the relative position compactness is determined according to the distance between each knowledge subtree contained in the second-level title in the first knowledge subtree expansion list and the second-level title.
In a second aspect, an embodiment of the present application further provides a device for generating a knowledge tree, where the device includes:
the data acquisition module acquires a data source to be processed; wherein the data source to be processed comprises data files in a plurality of file formats;
the file analysis module is used for carrying out file analysis processing on the data source to be processed to obtain a data text to be processed with a data exchange format;
the title determining module is used for determining a primary title, a secondary title and position information of the secondary title from the data text to be processed; the position information of the secondary title is determined by a to-be-processed data list obtained after the secondary title is used for carrying out segmentation processing on the to-be-processed data text;
The subtree generation module is used for extracting the title of the data list to be processed, and generating a knowledge subtree list by using the extracted title, wherein the knowledge subtree list comprises a plurality of knowledge subtrees, and the title level of any title in each knowledge subtree is smaller than the title level of the secondary title;
and the knowledge tree generation module is used for merging the primary title, the secondary title and the knowledge subtree list according to the position information of the secondary title to obtain a knowledge tree.
In a third aspect, embodiments of the present application further provide an electronic device, including: a processor, a memory and a bus, said memory storing machine readable instructions executable by said processor, said processor and said memory communicating via the bus when the electronic device is running, said machine readable instructions when executed by said processor performing the steps of the method of generating a knowledge tree as described above.
In a fourth aspect, embodiments of the present application also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the knowledge tree generation method as described above.
The embodiment of the application provides a knowledge tree generation method, a knowledge tree generation device, electronic equipment and a knowledge tree storage medium, wherein a data source to be processed of data files with various file formats is acquired first, and file analysis processing is carried out on the data source to be processed to obtain a data text to be processed with a data exchange format; then determining a primary title, a secondary title and position information of the secondary title from the data text to be processed; the position information of the secondary title is determined by a to-be-processed data list obtained after the secondary title is used for carrying out segmentation processing on the to-be-processed data text; then, extracting the title from the data list to be processed, and generating a knowledge subtree list comprising a plurality of knowledge subtrees by using the extracted title, wherein the title level of any title in each knowledge subtree in the knowledge subtree list is smaller than the title level of the second-level title; and finally, merging the primary title, the secondary title and the knowledge subtree list according to the position information of the secondary title to obtain a knowledge tree.
Compared with the knowledge tree generation method in the prior art, which mainly aims at text data and generally needs to classify and generalize knowledge manually, and then manually establishes a hierarchical structure to express the relation between different knowledge, the knowledge tree generation method can generate knowledge trees based on to-be-processed data sources of data files in various file formats, has wide application and no need of manual operation, can improve the operation efficiency of users, and can accurately and rapidly integrate the data files in various file formats into the knowledge trees.
In order to make the above objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered limiting the scope, and that other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a knowledge tree generation method according to an embodiment of the present application;
fig. 2 is a schematic diagram of a video key frame in a PPT video according to an embodiment of the present application;
FIG. 3 is a second schematic diagram of video keyframes in a PPT video according to an embodiment of the present disclosure;
FIG. 4 is a third schematic diagram of video keyframes in a PPT video according to an embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of a knowledge tree generating device according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, which are generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, as provided in the accompanying drawings, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. Based on the embodiments of the present application, every other embodiment that a person skilled in the art would obtain without making any inventive effort is within the scope of protection of the present application.
The knowledge tree is an orderly and clear knowledge representation mode with a hierarchical structure, is generally used for organizing and managing a large amount of knowledge, and is convenient for users to browse and inquire the required information. Traditional knowledge tree generation techniques are mainly directed to text data, and generally require manual classification and generalization of knowledge, and then manually build a hierarchical structure to represent relationships between different knowledge.
For example, in practical applications, because online courses are of a wide variety, not all courses have structured course schemas, and therefore, the outline of all courses cannot be obtained only by a web crawler, which poses a challenge to the construction of a knowledge tree; in addition, the knowledge points are combined only by using an incomplete character string matching technology, only character features are considered, semantic features are ignored, and the finally constructed knowledge tree has the problems of redundancy and inaccuracy.
Furthermore, with the increasing number of multi-modal data sources (such as text, image, audio, video, etc.), the conventional knowledge tree generation technology cannot well process multi-modal data, and how to accurately and rapidly integrate the multi-modal data sources into the knowledge tree becomes a problem to be solved.
Based on the above, the embodiment of the application provides a knowledge tree generation method, which not only can improve the operation efficiency of a user, but also can accurately and rapidly integrate data files with various file formats into the knowledge tree.
Referring to fig. 1, fig. 1 is a flowchart of a knowledge tree generating method according to an embodiment of the present application. As shown in fig. 1, a generating method provided in an embodiment of the present application includes:
S101, acquiring a data source to be processed; the data source to be processed comprises data files in various file formats;
s102, carrying out file analysis processing on a data source to be processed to obtain a data text to be processed with a data exchange format;
s103, determining the position information of the primary title, the secondary title and the secondary title from the data text to be processed; the position information of the secondary title is determined by a to-be-processed data list obtained after the secondary title is used for carrying out segmentation processing on the to-be-processed data text;
s104, extracting the title of the data list to be processed, and generating a knowledge subtree list by using the extracted title, wherein the knowledge subtree list comprises a plurality of knowledge subtrees, and the title level of any title in each knowledge subtree is smaller than the title level of the second-level title;
s105, combining the primary title, the secondary title and the knowledge subtree list according to the position information of the secondary title to obtain a knowledge tree.
In the steps S101 to S105, file analysis processing can be performed on a to-be-processed data source including data files with multiple file formats, so as to obtain a to-be-processed data text with a data exchange format, and further, position information of a primary title, a secondary title and a secondary title can be rapidly determined from the to-be-processed data text, where the position information of the secondary title is determined by using a to-be-processed data list obtained after the secondary title performs segmentation processing on the to-be-processed data text; on the basis of the data lists to be processed, extracting the title of each data list to be processed, and generating a knowledge subtree list comprising a plurality of knowledge subtrees by using the title extracted from the data list to be processed, wherein the title level of any title in each knowledge subtree in the knowledge subtree list is smaller than the title level of the secondary title; and finally, merging the primary title, the secondary title and the knowledge subtree list according to the position information of the secondary title to obtain a knowledge tree. Therefore, the knowledge tree can be generated based on the data sources to be processed of the data files with various file formats, the application is wide, manual operation is not needed, the operation efficiency of a user can be improved, and the data files with various file formats can be accurately and rapidly integrated into the knowledge tree.
The following is an exemplary explanation of step S101 to step S105:
in step S101, a data source to be processed is acquired; wherein the data source to be processed comprises data files in a plurality of file formats.
Here, the plurality of file formats includes a document format, a picture format, and a video format. Exemplary, data sources to be processed include, but are not limited to, word documents, pdf documents, ppt video recordings. Alternatively, the data sources to be processed can be obtained from different websites and different platforms, and the data sources to be processed can relate to teaching-related content and travel-related content, and the application specific can be determined according to user requirements.
In step S102, file parsing processing is performed on the data source to be processed, so as to obtain a text of the data to be processed with a data exchange format.
Here, the purpose of the file parsing process is to convert the data source to be processed into text of the data to be processed having a data exchange format. The data exchange format in the embodiment of the application is preferably JSON (JavaScript Object Notation), and JSON is a lightweight data exchange format which is easy to analyze and does not need to write a large number of codes, so that the method is simple and efficient, and is suitable for most data transmission requirements.
Specifically, for the data sources to be processed with different formats, different file parsing modes can be selected to convert the data sources to be processed into the data text to be processed with the data exchange format. When the data text to be processed comprises a first data text or a second data text corresponding to the data file in the document format, a parser can be used for parsing the data file, for example, a word parser is adopted for a word document, and a pdf parser is selected for a pdf document; or converting the data file in the document format into the data file in the picture format, then analyzing the data file in the picture format by utilizing a character recognition technology, for example, converting the data file in the picture format into a picture for a PPT document, and then analyzing the data file in the picture format by utilizing an OCR technology (Optical Character Recognition); when the data text to be processed includes a third data text corresponding to the data file in the video format, the picture to be processed may be extracted from the data file first, and then the picture to be processed may be parsed by using a character recognition technique.
Illustratively, step S102 specifically includes:
step S102a, aiming at a data file in a document format, analyzing the data file in the document format by utilizing analyzers corresponding to different document formats to obtain a first data text in a data exchange format, wherein the first data text comprises characters, font sizes of the characters and position information of the characters; or converting the data file in the document format into the data file in the picture format, and converting the data file in the picture format into a second data text in the data exchange format by utilizing a character recognition technology, wherein the second data text comprises text segments corresponding to a plurality of pictures, position information of the text segments, color information of the text segments and font size of the text segments.
By way of example, the data file in the document format may include a Word document, a PDF document, and a PPT document, a Word parser may be used for the Word document, the Word document may be parsed into a first data text in the JSON format by the Word parser, or a PDF parser may be used for the PDF document, the PDF document may be parsed into a first data text in the JSON format by the PDF parser, and for the PPT document, the PPT document may be converted into a picture and then the picture may be parsed into a second data text in the JSON format by using OCR technology.
The first data text comprises characters, font sizes of the characters and position information of the characters, and the second data text comprises text segments corresponding to the pictures, position information of the text segments, color information of the text segments and font sizes of the text segments. Here, the primary title and the secondary title are extracted from the first data text according to the font size of the characters and the position information of the characters; the second data text comprises text segments corresponding to the plurality of pictures, position information of the text segments, color information of the text segments and font sizes of the text segments, and primary titles and secondary titles are extracted from the second data text according to the position information of the text segments, the color information of the text segments and the font sizes of the text segments.
Step S102b, extracting a picture to be processed from a data file aiming at the data file in the video format, and converting the picture to be processed into a third data text in the data exchange format by utilizing a character recognition technology; the third data text comprises text segments corresponding to the pictures to be processed, position information of the text segments, color information of the text segments and font sizes of the text segments.
Illustratively, the data files in the video format may include MP4 video, AVI video, MOV video, and the data files in the video format may be PPT video. Extracting a picture to be processed corresponding to the PPT key frame through steps of frame extraction, filtering, duplicate removal and the like aiming at a data file in a video format, analyzing the picture to be processed into a third data text in a JSON format by utilizing an OCR technology, and reserving position information of a text segment, color information of the text segment and font size of the text segment in the third data text.
The third data text comprises text segments corresponding to the plurality of pictures to be processed, position information of the text segments, color information of the text segments and font sizes of the text segments, and primary titles and secondary titles are extracted from the third data text according to the position information of the text segments, the color information of the text segments and the font sizes of the text segments.
In an alternative embodiment, step S102b specifically includes:
step S102b1, frame extraction is carried out on a data file in a video format according to a preset time interval, and a plurality of first pictures to be processed are obtained;
step S102b2, calculating first similarity between adjacent first pictures to be processed, and performing de-duplication on the first pictures to be processed, of which the first similarity is greater than a first preset similarity threshold value, to obtain second pictures to be processed, of which repeated pictures are not present;
step S102b3, performing character recognition on the second to-be-processed picture by utilizing a character recognition technology, obtaining a character recognition result of the second to-be-processed picture, and filtering the second to-be-processed picture without characters to obtain a character recognition result of the remaining second to-be-processed picture;
step S102b4, calculating a second similarity between character recognition results of adjacent remaining second to-be-processed pictures, and performing de-duplication on the character recognition results of the remaining second to-be-processed pictures with the second similarity being greater than a second preset similarity threshold value to obtain a third data text with a data exchange format.
In the steps S102b1 to S102b4, a plurality of first pictures to be processed are extracted from the data file in the video format, and then the extracted first pictures to be processed are subjected to a first filtering process to obtain a second picture to be processed, wherein the first filtering process mainly comprises the steps of de-duplicating similar pictures; then, recognizing a second picture to be processed by utilizing a character recognition technology, and performing second filtering treatment on the character recognition result of the second picture to be processed to obtain the character recognition result of the remaining second picture to be processed, wherein the second filtering treatment mainly deletes the picture without characters; and finally, carrying out third filtering processing on character recognition results of the remaining second pictures to be processed to obtain a third data text with a data exchange format, wherein the third filtering processing mainly deletes contents with similar character recognition results. It should be noted that, the character recognition result obtained after the character recognition technology performs the character recognition on the second to-be-processed picture is already the data text with the data exchange format, and the third data text with the data exchange format can be obtained through three times of filtering processing. Here, the third data text may ensure that there is no case where the images are similar and the text contents are similar.
Illustratively, in step S102b2, a hash algorithm may be used to calculate a first similarity between adjacent first pictures to be processed; in step S102b4, a second similarity between character recognition results of adjacent remaining second to-be-processed pictures is calculated, and a cosine similarity method may be adopted; here, the calculation method of the first similarity and the calculation method of the second similarity are not particularly limited.
It should be added that when the data source to be processed includes a data file in a picture format, the data file in the picture format can be directly converted into a fourth data text in a data exchange format by utilizing a character recognition technology; the fourth data text includes text segments corresponding to the plurality of pictures, location information of the text segments, color information of the text segments, and font sizes of the text segments. And further, the primary title and the secondary title can be extracted from the fourth data text according to the position information of the text segment, the color information of the text segment and the font size of the text segment. By way of example, the data file in picture format may include JPG pictures, PNG pictures, and the like. Aiming at the JPG picture or the PNG picture, the JPG picture or the PNG picture can be directly analyzed into a data text in a JSON format by utilizing a character recognition technology.
In step S103, determining a primary title, a secondary title, and position information of the secondary title from the data text to be processed; the position information of the secondary title is determined by a to-be-processed data list obtained after the secondary title is used for carrying out segmentation processing on the to-be-processed data text.
Here, the primary title refers to a main title of the data text to be processed, typically, the main title is located in front of the data text to be processed, and the font size of the main title is larger than that of the body, so as to represent a theme summary of the data text to be processed, which has the highest hierarchy in the data text. The secondary title is an explanatory decomposition of the primary title, belongs to a secondary title of the data text to be processed, and can be located in front of the data text to be processed in a directory form in general, wherein the font size of the secondary title is smaller than that of the primary title, and is used for introducing specific contents under the primary title, and the hierarchy of the secondary title in the data text is lower than that of the primary title.
Further, the position information of the secondary title is determined by using a to-be-processed data list obtained after the secondary title performs segmentation processing on the to-be-processed data text. Specifically, a secondary title is determined from the data text to be processed, segmentation is carried out on the data text to be processed according to the secondary title to obtain a data list to be processed under a plurality of secondary titles, and meanwhile position information of the secondary title is reserved.
In an alternative embodiment, step S103 specifically includes:
step S103a, determining a primary title and a secondary title according to the font size of characters in the first data text and the position information of the characters, and carrying out segmentation processing on the data text to be processed by utilizing the secondary title to obtain a data list to be processed; and determining the position information of the secondary title according to the page number corresponding to the data list to be processed.
Here, the larger the hierarchy of the title, the larger the font of the corresponding character of the title, the more advanced the character position, and further, the primary title and the secondary title can be extracted from the first data text according to the font size of the character and the position information of the character in the first data text. And segmenting the first data text by using the extracted secondary title to obtain a to-be-processed data list corresponding to the secondary title, that is, the level of the title extracted from the to-be-processed data list corresponding to the secondary title is smaller than that of the secondary title. In addition, the position information of the secondary header may be represented by the number of pages corresponding to the data list to be processed.
Step S103b, determining a primary title and a secondary title according to color information of text segments corresponding to a plurality of pictures in a second data text, font size of the text segments and position information of the text segments, and carrying out segmentation processing on the data text to be processed to obtain a data list to be processed corresponding to the plurality of pictures, wherein the number of pages of the data list to be processed is the same as the number of the pictures; determining the position information of a secondary title according to a data list to be processed corresponding to a plurality of pictures; the second-level title determined from the second data text is obtained by performing de-duplication processing on text segments corresponding to the plurality of pictures.
Here, the larger the hierarchy of the title, the larger the font of the text segment corresponding to the title, the more forward the text segment is located, and the fewer the number of occurrences of the color of the text segment; further, the primary title and the secondary title may be extracted from the second data text based on color information of the text segment in the second data text, font size of the text segment, and location information of the text segment. And segmenting the second data text by using the extracted secondary title to obtain a data list to be processed corresponding to the secondary title, that is, the level of the title extracted from the data list to be processed corresponding to the secondary title is smaller than the level of the secondary title. In addition, the position information of the secondary header may be represented by the number of pages corresponding to the data list to be processed, where the number of pages corresponding to the data list to be processed corresponds to the number of pictures.
For the PPT document, extracting a main title (primary title) of the PPT document according to the position information, the font size and the color information of a text segment by judging the text segment corresponding to the picture with the front page; judging whether the text segment of each page of picture contains keywords such as chapters, catalogues and the like, and if yes, regarding the text segment as a catalogue page; and summarizing text segments of each directory page and removing duplication to obtain a secondary title, wherein colleagues retain the position information of the directory page. Here, both the primary and secondary titles act as the main branches of the finally generated knowledge tree.
Step S103c, determining a primary title and a secondary title according to color information of text segments corresponding to the multiple pictures to be processed in the third data text, font size of the text segments and position information of the text segments, and carrying out segmentation processing on the data text to be processed to obtain a data list to be processed corresponding to the multiple pictures to be processed, wherein the number of pages of the data list to be processed is the same as the number of the pictures to be processed; and determining the position information of the secondary title according to the to-be-processed data list corresponding to the plurality of to-be-processed pictures, wherein the secondary title determined from the third data text is obtained by performing duplicate removal processing on text segments corresponding to the plurality of to-be-processed pictures.
Here, the description of step S103c may refer to step S103b, and will not be repeated here.
In this embodiment, when the data text includes at least one of the first data text, the second data text and the third data text, the step S103a, the step S103b and the step S103c may be arbitrarily combined, so as to ensure that the position information of the primary title, the secondary title and the secondary title is accurately extracted from the data text to be processed converted from the data files with multiple formats, thereby ensuring that the title extraction is not omitted.
In step S104, the header extraction is performed on the data list to be processed, and a knowledge sub-tree list is generated using the extracted header, the knowledge sub-tree list includes a plurality of knowledge sub-trees, and the header level of any header in each knowledge sub-tree is smaller than the header level of the secondary header.
Here, a knowledge sub-tree list is generated using the titles extracted from the data list to be processed, the knowledge sub-tree list including titles corresponding to a plurality of knowledge sub-trees, the titles corresponding to the plurality of knowledge sub-trees being arranged in order to compose the knowledge sub-tree list. For example, the first knowledge tree includes a title including tree_1 and tree_2, the second knowledge tree includes a title including tree_3, tree_4, tree_5 and tree_6, and the third knowledge tree includes a title including tree_7, tree_8 and tree_9, and the knowledge tree list includes [ tree_1, tree_2, tree_3, tree_4, tree_5, tree_6, tree_7, tree_8 and tree_9].
In an alternative embodiment, step S104 specifically includes:
step S104a, inputting the first data text into the sub-tree generation model, and outputting a plurality of knowledge sub-trees to form a knowledge sub-tree list.
In an alternative embodiment, the subtree generation model may be a pre-trained neural network model, where the input of the subtree generation model is a first data text and the output is a plurality of knowledge subtrees; the first data text comprises a plurality of data lists to be processed, each data list to be processed correspondingly generates a knowledge sub-tree, and the knowledge sub-tree lists are formed by the knowledge sub-trees. Preferably, the subtree generative model may be a GPT model (generative pre-training transducer model).
Step S104b, determining each level of titles according to the position information of text segments corresponding to the pictures included in the second data text, the color information of the text segments and the font size of the text segments, and generating a plurality of knowledge subtrees according to each level of titles to form a knowledge subtree list; wherein each picture corresponds to a knowledge sub-tree.
Each level of title is extracted from the second data text in a similar manner to step S103b, and all the extracted titles are formed into a plurality of knowledge sub-trees. Wherein each picture corresponds to a knowledge sub-tree.
Step S104c, determining each level of titles according to the position information of text segments corresponding to the plurality of pictures to be processed, the color information of the text segments and the font size of the text segments, which are included in the third data text, and generating a plurality of knowledge subtrees according to each level of titles to form a knowledge subtree list; wherein each picture to be processed corresponds to a knowledge sub-tree.
Here, the description of step S104c may refer to step S104b, and will not be repeated here.
In this embodiment of the present application, when the data text includes at least one of the first data text, the second data text and the third data text, the step S104a, the step S104b and the step S104c may be arbitrarily combined, so as to ensure that a plurality of knowledge subtrees are generated from the data text to be processed converted from the data files in multiple formats, and further ensure that the generated knowledge subtrees are comprehensive.
The purpose of the above step S104 is to convert unstructured data in the list of data to be processed into a clear structured knowledge sub-tree. Illustratively, a clear structured knowledge sub-tree may be represented by a list, such as [ tree_1, tree_2, tree_3, ], tree_n ].
For example, the list of pending data may be represented as [ data_1, data_2, data_3,..data_n ], for each data_x in the list of pending data: for word documents or pdf documents, inputting text segments in data_x into a subtree generation model, and outputting a knowledge subtree_x with a two-layer structure by the subtree generation model; and extracting a knowledge sub tree_x of the two-layer structure according to the position information of the text segment in the data_x, the color information of the text segment and the font size of the text segment for the ppt document or the ppt video. Further, in the above manner, a clear structured knowledge sub-tree [ tree_1, tree_2, tree_3, ] represented by a list can be obtained.
In step S105, the primary title, the secondary title and the knowledge subtree list are combined according to the location information of the secondary title to obtain a knowledge tree.
Specifically, in the embodiment of the present application, the second-level header may be inserted into the knowledge subtree list to obtain the knowledge subtree expansion list, and then the first-level header and the knowledge subtree expansion list are combined to obtain the knowledge tree.
In an alternative embodiment, step S105 specifically includes:
step S105a, inserting the secondary title into a knowledge subtree list according to the position information of the secondary title to obtain a knowledge subtree expansion list; wherein each secondary topic in the knowledge sub-tree expansion list contains all knowledge sub-trees ordered after the secondary topic and ordered before the next secondary topic;
step S105b, calculating first semantic similarity between any two secondary titles, de-duplicating the secondary titles with the first semantic similarity larger than a first preset semantic similarity threshold according to second semantic similarity between the secondary titles and knowledge subtrees contained in the secondary titles, and updating the knowledge subtrees expansion list to obtain a first knowledge subtrees expansion list;
step S105c, traversing all knowledge subtrees contained in the second-level title aiming at each second-level title in the first knowledge subtree expansion list, and calculating a first association degree between the second-level title and a root node of each knowledge subtree contained in the second-level title;
step S105d, calculating third semantic similarity between any two knowledge subtrees contained in each secondary title in the first knowledge subtree expansion list, de-duplicating the knowledge subtrees with the third semantic similarity being greater than a second preset semantic similarity threshold according to the first association degree, and updating the first knowledge subtree expansion list to obtain a second knowledge subtree expansion list;
Step S105e, traversing all knowledge subtrees contained in each secondary title in the second knowledge subtree expansion list, and calculating fourth semantic similarity between the root node of the target knowledge subtree and leaf nodes of other knowledge subtrees; the target knowledge subtree is any knowledge subtree contained in the secondary title in the second knowledge subtree expansion list, and other knowledge subtrees are the rest knowledge subtrees except the target knowledge subtree of all the knowledge subtrees contained in the secondary title;
step S105f, adjusting the hierarchical relationship between the target knowledge subtree with the fourth semantic similarity larger than the third preset semantic similarity threshold and other knowledge subtrees to enable the target knowledge subtree to be combined with the other knowledge subtrees, and updating the second knowledge subtree expansion list to obtain a third knowledge subtree expansion list;
step S105g, generating a knowledge tree according to the first-level title and the third knowledge sub-tree expansion list.
In the above steps S105a to S105g, a plurality of knowledge subtrees, primary titles and secondary titles are combined into one knowledge tree representing the data source to be processed. Specifically, each node in the main branch of the knowledge tree may be represented by a secondary header table_n. Table_n may be inserted into the knowledge subtree list according to the location information of the secondary title. Thus, the knowledge sub-tree list may be expanded into a knowledge sub-tree expanded list, such as [ table_1, tree_1, tree_2, table_2, tree_3, tree_4 ], where tree_1, tree_2 indicates that the two knowledge sub-trees are contained under the secondary header table_1.
Specifically, step S105b performs secondary header deduplication on the secondary headers that repeatedly appear through the first semantic similarity between any two secondary headers; step 105c and step 105d perform knowledge sub-tree deduplication on the repeated knowledge sub-trees through a first association degree between the secondary header and the root node of each knowledge sub-tree contained in the secondary header and a third semantic similarity degree between any two knowledge sub-trees contained in each secondary header in the first knowledge sub-tree expansion list; step S105e adjusts the hierarchical relationship between the target knowledge subtree and the other knowledge subtrees by the fourth semantic similarity between the root node of the target knowledge subtree and the leaf nodes of the other knowledge subtrees. All knowledge subtrees can be hung on the main branches of the knowledge tree through the steps, so that the knowledge tree can be combined more accurately and compactly.
In an alternative embodiment, the first semantic similarity, the second semantic similarity, the third semantic similarity, and the fourth semantic similarity are all determined according to a weighted result between the edit distance and the cosine similarity; the first degree of association is determined based on weighted results between the edit distance, cosine similarity, and relative position closeness between the second-level title in the first expanded list of knowledge sub-trees and the root node of each of the second-level titles contained therein, and the relative position closeness is determined based on the distance between each of the second-level titles contained in the first expanded list of knowledge sub-trees and the second-level title.
For example, the relative position closeness may be represented by a ratio between a distance between a knowledge sub-tree contained in the secondary title and a total distance corresponding to all knowledge sub-trees contained in the secondary title, in other words, the relative position closeness and the ratio are in a negative correlation, and a knowledge sub-tree with a larger relative position closeness is reserved. Assuming that the first knowledge sub-tree expansion list obtained after step S105b is [ table_1, tree_1, tree_2, table_2, tree_3, tree_4, tree_5, tree_6, table_3, tree_7, tree_8, tree_9, ], table_n, tree_m ], the ratio corresponding to the relative position compactness between the two-level headers table_1, tree_1 and table_1 may be expressed as 0.5, the ratio corresponding to the relative position compactness between the tree_2 and table_1 may be expressed as 1, where the relative position compactness between the tree_1 and the table_1 is larger than the relative position compactness between the tree_2 and the table_1; the ratio of the relative position affinity between tree_5 and table_2 may be expressed as 0.6, and the ratio of the relative position affinity between tree_6 and table_2 may be expressed as 0.8, where the relative position affinity between tree_5 and table_2 is greater relative to the relative position affinity between tree_6 and table_2.
For example, a product introduction ppt video recording construction knowledge tree in a certain field is taken as an example to specifically describe, and a part of video key frames are intercepted from a section of ppt video recording as shown in fig. 2 to fig. 4:
step A, setting a preset time interval of 3s, and performing frame extraction on ppt video recording to obtain 300 pictures in total; calculating first similarity of adjacent pictures by using a hash method to remove the duplicate of the pictures, and reserving 50 pictures in total; OCR recognition is carried out on the picture after the duplication removal to obtain an OCR recognition result; filtering OCR recognition results corresponding to pictures without characters, performing second similarity calculation on OCR recognition results of adjacent pictures to remove duplication, and reserving the OCR recognition results corresponding to 23 pictures altogether, and further sequentially reserving the OCR recognition results to obtain a data list to be processed, wherein the data list is shown as [ data_1, data_2, ], data_23];
step B, analyzing the first 5 pictures of the part of video key frames intercepted in the ppt video recording, wherein the text segment with the largest font size is ABCDEFGXXXXX, and selecting the text segment as the main title of the knowledge tree, namely the knowledge tree root node; traversing the OCR recognition results corresponding to the 23 pictures, selecting 3 pictures containing a text section of a directory or a chapter, extracting the text section, and de-duplicating to obtain a directory structure of [ "01aBcDxxx", "02 DEfxxx", "03 AFhexx", "04yYZkxxx" ]; the position information is [ 3 rd, 9 th, 14 th and 18 th ];
Step C, for each data_x in the pending data list [ data_1, data_2, data_3,..data_23 ], generating a knowledge sub-tree of a secondary structure, taking data_5 as an example: firstly, acquiring a text segment with the position at the leftmost upper corner as a first-level title of a knowledge subtree, then acquiring a text segment below the first-level title position, and selecting a second-level title [ "affxxx", "bfxx", "efgxxx", "xyzxxx" ] according to the color information characteristics of the text segment and the font size characteristics; taking a first-level title 'axcbxx' as a root node of a knowledge subtree, taking a second-level title [ 'afhxxx', 'bfxxk', 'efgxxx', 'xyzxxx' ] as a leaf node of the knowledge subtree to obtain a tree_11, and finally obtaining a knowledge subtree list [ tree_1, tree_2 ],. A., tree_23] according to the position sequence of the picture;
step D, table _1, table_2, table_3, table_4 are directory trees containing only one root node (i.e. are secondary titles of the knowledge tree finally generated), specifically, table_1 is a directory tree using "01 abbdxxx" as a root node; table_2 is a directory tree with "02 DEfxxxx" as the root node; table_3 is a directory tree with "03 AFhexx" as a root node; table_4 is a directory tree with "04yyZkxxx" as a root node; according to the position relation of the table_n, the table_1, the table_2, the table_3 and the table_4 can be inserted into the knowledge subtree list as follows:
[ tree_1, tree_2, table_1, tree_3, ], table_2, tree_9, ], table_13, tree_14, ], table_4, tree_18, ], tree_23; wherein, tree_3 to tree_8 are contained in directory tree table_1; tree_9 to tree_13 are included in directory tree table_2; tree_14 to tree_17 are included in directory tree table_3; tree_18 to tree_23 are included in directory tree table_4;
for example, in step D, deduplication may be performed by calculating a third semantic similarity between table_1, table_2, table_3, table_4, two by two; then for table_1: traversing tree_3 to tree_8, and calculating a first association degree between a root node and table_1; for table_2: traversing tree_9 to tree_13, and calculating a first association degree between a root node and table_2; for table_3: traversing tree_14 to tree_17, and calculating a first association degree between a root node and table_3; for table_4: traversing tree_18 to tree_23, and calculating a first association degree between a root node and table_4; for repeated knowledge subtrees, reserving the knowledge subtrees with the highest first association degree with the tables_1, table_2, table_3 and table_4; that is, if the tree_5 and the tree_11 are repeated, but the first association degree between the tree_5 and the table_1 is higher than the first association degree between the tree_11 and the table_2, the tree_5 is reserved, and the tree_11 is deleted; for table_1, all knowledge subtrees comprise tree_3 to tree_8, fourth semantic similarity between a root node of tree_a and a leaf node of tree_b is calculated, if the fourth semantic similarity two-point number value exceeds a threshold value of 0.85, the tree_a is deleted, and a tree structure of the tree_a is hung on the leaf node of the corresponding position of the tree_b, wherein the tree_a represents any one of the tree_3 to the tree_8, the tree_b is different from the tree_a, and the tree_b represents the other one of the tree_3 to the tree_8. Thus, the combination of the knowledge subtrees is completed, and the knowledge tree is obtained.
Compared with the knowledge tree generation method mainly aiming at text data in the prior art, which generally needs to classify and generalize knowledge manually and then manually establish a hierarchical structure to express the relation among different knowledge, the knowledge tree generation method provided by the embodiment of the invention can more effectively excavate and construct a knowledge tree for a to-be-processed data source containing data files in various file formats, improves the construction effect, and improves the accuracy of knowledge tree generation for the to-be-processed data source based on the data files in various file formats. In addition, the knowledge tree in the application can clearly carry out hierarchical knowledge tree construction on the data source to be processed, so that specific position information of each level of titles in the data source to be processed in the knowledge tree can be traced, the knowledge tree can be presented more comprehensively, and a user can trace and search the source and the position of a specific title conveniently.
Based on the same inventive concept, the embodiment of the present application further provides a knowledge tree generating device corresponding to the knowledge tree generating method, and since the principle of solving the problem by the device in the embodiment of the present application is similar to that of the generating method in the embodiment of the present application, the implementation of the device may refer to the implementation of the method, and the repetition is omitted.
Referring to fig. 5, fig. 5 is a schematic structural diagram of a generating device according to an embodiment of the present application. As shown in fig. 5, the generating device 500 includes:
the data acquisition module 501 acquires a data source to be processed; wherein the data source to be processed comprises data files in a plurality of file formats;
the file analysis module 502 is configured to perform file analysis processing on the data source to be processed to obtain a text of the data to be processed with a data exchange format;
a header determining module 503, configured to determine a primary header, a secondary header, and location information of the secondary header from the data text to be processed; the position information of the secondary title is determined by a to-be-processed data list obtained after the secondary title is used for carrying out segmentation processing on the to-be-processed data text;
a subtree generation module 504, configured to perform title extraction on a data list to be processed, and generate a knowledge subtree list using the extracted titles, where the knowledge subtree list includes a plurality of knowledge subtrees, and a title level of any title in each knowledge subtree is smaller than a title level of the secondary title;
the knowledge tree generating module 505 is configured to combine the primary title, the secondary title, and the knowledge subtree list according to the location information of the secondary title, to obtain a knowledge tree.
In an alternative embodiment, the plurality of file formats includes a document format and/or a video format, and the file parsing module 502 is specifically configured to:
analyzing the data files in the document format by utilizing resolvers corresponding to different document formats to obtain a first data text with a data exchange format, wherein the first data text comprises characters, font sizes of the characters and position information of the characters; or converting the data file in the document format into the data file in the picture format, and converting the data file in the picture format into a second data text in the data exchange format by utilizing a character recognition technology, wherein the second data text comprises text segments corresponding to a plurality of pictures, position information of the text segments, color information of the text segments and font size of the text segments;
extracting a picture to be processed from a data file aiming at the data file in the video format, and converting the picture to be processed into a third data text in the data exchange format by utilizing a character recognition technology; the third data text comprises text segments corresponding to the pictures to be processed, position information of the text segments, color information of the text segments and font sizes of the text segments.
In an alternative embodiment, the file parsing module 502 is specifically further configured to:
drawing frames of the data file in the video format according to a preset time interval to obtain a plurality of first pictures to be processed;
calculating first similarity between adjacent first pictures to be processed, and performing de-duplication on the first pictures to be processed, of which the first similarity is larger than a first preset similarity threshold value, so as to obtain second pictures to be processed, of which repeated pictures are not present;
performing character recognition on the second to-be-processed picture by using a character recognition technology, acquiring a character recognition result of the second to-be-processed picture, and filtering the second to-be-processed picture without characters to acquire a character recognition result of the remaining second to-be-processed picture;
and calculating second similarity between character recognition results of adjacent remaining second to-be-processed pictures, and performing de-duplication on the character recognition results of the remaining second to-be-processed pictures with the second similarity being greater than a second preset similarity threshold value to obtain a third data text with a data exchange format.
In an alternative embodiment, the data text includes at least one of a first data text, a second data text, and a third data text, and the title determining module 503 is specifically configured to:
Determining a primary title and a secondary title according to the font size of characters in the first data text and the position information of the characters, and carrying out segmentation processing on the data text to be processed by utilizing the secondary title to obtain a data list to be processed; determining the position information of the secondary title according to the page number corresponding to the data list to be processed;
and/or determining a primary title and a secondary title according to color information of text segments corresponding to the plurality of pictures in the second data text, font size of the text segments and position information of the text segments, and carrying out segmentation processing on the data text to be processed to obtain a data list to be processed corresponding to the plurality of pictures, wherein the number of pages of the data list to be processed is the same as the number of the plurality of pictures; determining the position information of a secondary title according to a data list to be processed corresponding to a plurality of pictures; the second-level title determined from the second data text is obtained by performing duplicate removal processing on text segments corresponding to a plurality of pictures;
and/or determining a first-level title and a second-level title according to color information of text segments corresponding to the plurality of pictures to be processed in the third data text, font sizes of the text segments and position information of the text segments, and carrying out segmentation processing on the data text to be processed to obtain a data list to be processed corresponding to the plurality of pictures to be processed, wherein the number of pages of the data list to be processed is the same as the number of the pictures to be processed; and determining the position information of a secondary title according to a to-be-processed data list corresponding to the plurality of to-be-processed pictures, wherein the secondary title determined from the third data text is obtained by performing duplicate removal processing on text segments corresponding to the plurality of to-be-processed pictures.
In an alternative embodiment, the subtree generation module 504 is specifically configured to:
inputting the first data text into a subtree generation model, and outputting a plurality of knowledge subtrees to form a knowledge subtree list;
and/or determining each level of title according to the position information of the text segment corresponding to the plurality of pictures included in the second data text, the color information of the text segment and the font size of the text segment, and generating a plurality of knowledge subtrees according to each level of title to form a knowledge subtree list; wherein each picture corresponds to a knowledge sub-tree;
and/or determining each level of title according to the position information of the text segment corresponding to the plurality of pictures to be processed, the color information of the text segment and the font size of the text segment, which are included in the third data text, and generating a plurality of knowledge subtrees according to each level of title to form a knowledge subtree list; wherein each picture to be processed corresponds to a knowledge sub-tree.
In an alternative embodiment, the knowledge tree generation module 505 is specifically configured to:
inserting the secondary title into the knowledge subtree list according to the position information of the secondary title to obtain a knowledge subtree expansion list; wherein each secondary topic in the knowledge sub-tree expansion list contains all knowledge sub-trees ordered after the secondary topic and ordered before the next secondary topic;
Calculating first semantic similarity between any two secondary titles, de-duplicating the secondary titles with the first semantic similarity being larger than a first preset semantic similarity threshold according to second semantic similarity between the secondary titles and a knowledge subtree contained in the secondary titles, and updating the knowledge subtree expansion list to obtain a first knowledge subtree expansion list;
traversing all knowledge subtrees contained in the second-level title aiming at each second-level title in the first knowledge subtree expansion list, and calculating a first association degree between the second-level title and a root node of each knowledge subtree contained in the second-level title;
calculating third semantic similarity between any two knowledge subtrees contained in each secondary title in the first knowledge subtree expansion list, de-duplicating the knowledge subtrees with the third semantic similarity larger than a second preset semantic similarity threshold according to the first association degree, and updating the first knowledge subtree expansion list to obtain a second knowledge subtree expansion list;
traversing all knowledge subtrees contained in the secondary titles aiming at each secondary title in the second knowledge subtree expansion list, and calculating fourth semantic similarity between the root node of the target knowledge subtree and leaf nodes of other knowledge subtrees; the target knowledge subtree is any knowledge subtree contained in a secondary title in the second knowledge subtree expansion list, and the other knowledge subtrees are residual knowledge subtrees except the target knowledge subtree of all the knowledge subtrees contained in the secondary title;
Adjusting the hierarchical relationship between the target knowledge subtree with the fourth semantic similarity larger than a third preset semantic similarity threshold and other knowledge subtrees to enable the target knowledge subtree to be combined with the other knowledge subtrees, and updating the second knowledge subtree expansion list to obtain a third knowledge subtree expansion list;
and generating a knowledge tree according to the first-level title and the third knowledge subtree expansion list.
In an alternative embodiment, the first semantic similarity, the second semantic similarity, the third semantic similarity, and the fourth semantic similarity are determined according to weighted results between edit distance and cosine similarity;
the first association degree is determined according to the weighted result of the editing distance, cosine similarity and relative position compactness between the second-level title in the first knowledge subtree expansion list and the root node of each knowledge subtree contained in the second-level title, and the relative position compactness is determined according to the distance between each knowledge subtree contained in the second-level title in the first knowledge subtree expansion list and the second-level title.
Compared with the knowledge tree generation method that mainly aims at text data in the prior art and generally needs to manually classify and summarize knowledge, and then manually establish a hierarchical structure to express the relation among different knowledge, the knowledge tree generation device provided by the embodiment of the application can more effectively excavate and construct a knowledge tree for a to-be-processed data source containing data files in multiple file formats, improves the construction effect, improves the accuracy of knowledge tree generation for the to-be-processed data source based on the data files in multiple file formats, is widely applied, does not need manual operation, can improve the operation efficiency of users, and can accurately and rapidly integrate the data files in multiple file formats into the knowledge tree. In addition, the knowledge tree in the application can clearly carry out hierarchical knowledge tree construction on the data source to be processed, so that specific position information of each level of titles in the data source to be processed in the knowledge tree can be traced, the knowledge tree can be presented more comprehensively, and a user can trace and search the source and the position of a specific title conveniently.
Referring to fig. 6, fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 6, the electronic device 600 includes a processor 601, a memory 602, and a bus 603.
The memory 602 stores machine-readable instructions executable by the processor 601, when the electronic device 600 is running, the processor 601 communicates with the memory 602 through the bus 603, and when the machine-readable instructions are executed by the processor 601, the steps of the method for generating a knowledge tree in the method embodiment shown in fig. 1 can be executed, and a specific implementation manner may refer to the method embodiment and will not be described herein.
The embodiment of the present application further provides a computer readable storage medium, where a computer program is stored on the computer readable storage medium, and when the computer program is executed by a processor, the steps of the method for generating a knowledge tree in the method embodiment shown in fig. 1 may be executed, and a specific implementation manner may refer to the method embodiment and will not be described herein.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.
In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Finally, it should be noted that: the foregoing examples are merely specific embodiments of the present application, and are not intended to limit the scope of the present application, but the present application is not limited thereto, and those skilled in the art will appreciate that while the foregoing examples are described in detail, the present application is not limited thereto. Any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or make equivalent substitutions for some of the technical features within the technical scope of the disclosure of the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A method for generating a knowledge tree, the method comprising:
acquiring a data source to be processed; wherein the data source to be processed comprises data files in a plurality of file formats;
carrying out file analysis processing on the data source to be processed to obtain a data text to be processed with a data exchange format;
determining the position information of a primary title, a secondary title and the secondary title from the data text to be processed; the position information of the secondary title is determined by a to-be-processed data list obtained after the secondary title is used for carrying out segmentation processing on the to-be-processed data text;
performing title extraction on a data list to be processed, and generating a knowledge subtree list by using the extracted titles, wherein the knowledge subtree list comprises a plurality of knowledge subtrees, and the title level of any title in each knowledge subtree is smaller than the title level of the secondary title;
and merging the primary title, the secondary title and the knowledge subtree list according to the position information of the secondary title to obtain a knowledge tree.
2. The generating method according to claim 1, wherein the plurality of file formats include a document format and/or a video format, the performing file parsing processing on the to-be-processed data source to obtain to-be-processed data text with a data exchange format includes:
Analyzing the data files in the document format by utilizing resolvers corresponding to different document formats to obtain a first data text with a data exchange format, wherein the first data text comprises characters, font sizes of the characters and position information of the characters; or converting the data file in the document format into the data file in the picture format, and converting the data file in the picture format into a second data text in the data exchange format by utilizing a character recognition technology, wherein the second data text comprises text segments corresponding to a plurality of pictures, position information of the text segments, color information of the text segments and font size of the text segments;
extracting a picture to be processed from a data file aiming at the data file in the video format, and converting the picture to be processed into a third data text in the data exchange format by utilizing a character recognition technology; the third data text comprises text segments corresponding to the pictures to be processed, position information of the text segments, color information of the text segments and font sizes of the text segments.
3. The method according to claim 2, wherein the extracting the picture to be processed from the data file for the data file in the video format and converting the picture to be processed into the third data text in the data exchange format using the character recognition technology includes:
Drawing frames of the data file in the video format according to a preset time interval to obtain a plurality of first pictures to be processed;
calculating first similarity between adjacent first pictures to be processed, and performing de-duplication on the first pictures to be processed, of which the first similarity is larger than a first preset similarity threshold value, so as to obtain second pictures to be processed, of which repeated pictures are not present;
performing character recognition on the second to-be-processed picture by using a character recognition technology, acquiring a character recognition result of the second to-be-processed picture, and filtering the second to-be-processed picture without characters to acquire a character recognition result of the remaining second to-be-processed picture;
and calculating second similarity between character recognition results of adjacent remaining second to-be-processed pictures, and performing de-duplication on the character recognition results of the remaining second to-be-processed pictures with the second similarity being greater than a second preset similarity threshold value to obtain a third data text with a data exchange format.
4. The generating method according to claim 2, wherein the data text includes at least one of a first data text, a second data text, and a third data text, and the determining the position information of the primary title, the secondary title, and the secondary title from the data text to be processed includes:
Determining a primary title and a secondary title according to the font size of characters in the first data text and the position information of the characters, and carrying out segmentation processing on the data text to be processed by utilizing the secondary title to obtain a data list to be processed; determining the position information of the secondary title according to the page number corresponding to the data list to be processed;
and/or determining a primary title and a secondary title according to color information of text segments corresponding to the plurality of pictures in the second data text, font size of the text segments and position information of the text segments, and carrying out segmentation processing on the data text to be processed to obtain a data list to be processed corresponding to the plurality of pictures, wherein the number of pages of the data list to be processed is the same as the number of the plurality of pictures; determining the position information of a secondary title according to a data list to be processed corresponding to a plurality of pictures; the second-level title determined from the second data text is obtained by performing duplicate removal processing on text segments corresponding to a plurality of pictures;
and/or determining a first-level title and a second-level title according to color information of text segments corresponding to the plurality of pictures to be processed in the third data text, font sizes of the text segments and position information of the text segments, and carrying out segmentation processing on the data text to be processed to obtain a data list to be processed corresponding to the plurality of pictures to be processed, wherein the number of pages of the data list to be processed is the same as the number of the pictures to be processed; and determining the position information of a secondary title according to a to-be-processed data list corresponding to the plurality of to-be-processed pictures, wherein the secondary title determined from the third data text is obtained by performing duplicate removal processing on text segments corresponding to the plurality of to-be-processed pictures.
5. The method of generating as claimed in claim 4, wherein the performing title extraction on the list of data to be processed, generating the list of knowledge sub-trees using the extracted title, comprises:
inputting the first data text into a subtree generation model, and outputting a plurality of knowledge subtrees to form a knowledge subtree list;
and/or determining each level of title according to the position information of the text segment corresponding to the plurality of pictures included in the second data text, the color information of the text segment and the font size of the text segment, and generating a plurality of knowledge subtrees according to each level of title to form a knowledge subtree list; wherein each picture corresponds to a knowledge sub-tree;
and/or determining each level of title according to the position information of the text segment corresponding to the plurality of pictures to be processed, the color information of the text segment and the font size of the text segment, which are included in the third data text, and generating a plurality of knowledge subtrees according to each level of title to form a knowledge subtree list; wherein each picture to be processed corresponds to a knowledge sub-tree.
6. The method of generating as claimed in claim 5, wherein said merging the primary title, the secondary title and the knowledge sub-tree list according to the location information of the secondary title to obtain a knowledge tree comprises:
Inserting the secondary title into the knowledge subtree list according to the position information of the secondary title to obtain a knowledge subtree expansion list; wherein each secondary topic in the knowledge sub-tree expansion list contains all knowledge sub-trees ordered after the secondary topic and ordered before the next secondary topic;
calculating first semantic similarity between any two secondary titles, de-duplicating the secondary titles with the first semantic similarity being larger than a first preset semantic similarity threshold according to second semantic similarity between the secondary titles and a knowledge subtree contained in the secondary titles, and updating the knowledge subtree expansion list to obtain a first knowledge subtree expansion list;
traversing all knowledge subtrees contained in the second-level title aiming at each second-level title in the first knowledge subtree expansion list, and calculating a first association degree between the second-level title and a root node of each knowledge subtree contained in the second-level title;
calculating third semantic similarity between any two knowledge subtrees contained in each secondary title in the first knowledge subtree expansion list, de-duplicating the knowledge subtrees with the third semantic similarity larger than a second preset semantic similarity threshold according to the first association degree, and updating the first knowledge subtree expansion list to obtain a second knowledge subtree expansion list;
Traversing all knowledge subtrees contained in the secondary titles aiming at each secondary title in the second knowledge subtree expansion list, and calculating fourth semantic similarity between the root node of the target knowledge subtree and leaf nodes of other knowledge subtrees; the target knowledge subtree is any knowledge subtree contained in a secondary title in the second knowledge subtree expansion list, and the other knowledge subtrees are residual knowledge subtrees except the target knowledge subtree of all the knowledge subtrees contained in the secondary title;
adjusting the hierarchical relationship between the target knowledge subtree with the fourth semantic similarity larger than a third preset semantic similarity threshold and other knowledge subtrees to enable the target knowledge subtree to be combined with the other knowledge subtrees, and updating the second knowledge subtree expansion list to obtain a third knowledge subtree expansion list;
and generating a knowledge tree according to the first-level title and the third knowledge subtree expansion list.
7. The method of generating of claim 6, wherein the first semantic similarity, the second semantic similarity, the third semantic similarity, and the fourth semantic similarity are each determined according to a weighted result between an edit distance and a cosine similarity;
The first association degree is determined according to the weighted result of the editing distance, cosine similarity and relative position compactness between the second-level title in the first knowledge subtree expansion list and the root node of each knowledge subtree contained in the second-level title, and the relative position compactness is determined according to the distance between each knowledge subtree contained in the second-level title in the first knowledge subtree expansion list and the second-level title.
8. A knowledge tree generation apparatus, wherein the generation apparatus comprises:
the data acquisition module acquires a data source to be processed; wherein the data source to be processed comprises data files in a plurality of file formats;
the file analysis module is used for carrying out file analysis processing on the data source to be processed to obtain a data text to be processed with a data exchange format;
the title determining module is used for determining a primary title, a secondary title and position information of the secondary title from the data text to be processed; the position information of the secondary title is determined by a to-be-processed data list obtained after the secondary title is used for carrying out segmentation processing on the to-be-processed data text;
the subtree generation module is used for extracting the title of the data list to be processed, and generating a knowledge subtree list by using the extracted title, wherein the knowledge subtree list comprises a plurality of knowledge subtrees, and the title level of any title in each knowledge subtree is smaller than the title level of the secondary title;
And the knowledge tree generation module is used for merging the primary title, the secondary title and the knowledge subtree list according to the position information of the secondary title to obtain a knowledge tree.
9. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory in communication over the bus when the electronic device is running, the processor executing the machine-readable instructions to perform the steps of the method of any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that it has stored thereon a computer program which, when executed by a processor, performs the steps of the method according to any of claims 1 to 7.
CN202410186566.1A 2024-02-20 2024-02-20 Knowledge tree generation method and device, electronic equipment and storage medium Pending CN117763206A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410186566.1A CN117763206A (en) 2024-02-20 2024-02-20 Knowledge tree generation method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410186566.1A CN117763206A (en) 2024-02-20 2024-02-20 Knowledge tree generation method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117763206A true CN117763206A (en) 2024-03-26

Family

ID=90310735

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410186566.1A Pending CN117763206A (en) 2024-02-20 2024-02-20 Knowledge tree generation method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117763206A (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040254904A1 (en) * 2001-01-03 2004-12-16 Yoram Nelken System and method for electronic communication management
EP1690198A1 (en) * 2003-12-05 2006-08-16 Edgenet, Inc. A method and apparatus for database induction for creating frame based knowledge tree
CN107357765A (en) * 2017-07-14 2017-11-17 北京神州泰岳软件股份有限公司 Word document flaking method and device
CN111460083A (en) * 2020-03-31 2020-07-28 北京百度网讯科技有限公司 Document title tree construction method and device, electronic equipment and storage medium
CN112016273A (en) * 2020-09-03 2020-12-01 平安科技(深圳)有限公司 Document directory generation method and device, electronic equipment and readable storage medium
CN112231522A (en) * 2020-09-24 2021-01-15 北京奥鹏远程教育中心有限公司 Online course knowledge tree generation association method
CN113779235A (en) * 2021-09-13 2021-12-10 北京市律典通科技有限公司 Word document outline recognition processing method and device
CN114390331A (en) * 2022-01-13 2022-04-22 徐州工业职业技术学院 Video teaching method, system, equipment and medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040254904A1 (en) * 2001-01-03 2004-12-16 Yoram Nelken System and method for electronic communication management
EP1690198A1 (en) * 2003-12-05 2006-08-16 Edgenet, Inc. A method and apparatus for database induction for creating frame based knowledge tree
CN107357765A (en) * 2017-07-14 2017-11-17 北京神州泰岳软件股份有限公司 Word document flaking method and device
CN111460083A (en) * 2020-03-31 2020-07-28 北京百度网讯科技有限公司 Document title tree construction method and device, electronic equipment and storage medium
CN112016273A (en) * 2020-09-03 2020-12-01 平安科技(深圳)有限公司 Document directory generation method and device, electronic equipment and readable storage medium
CN112231522A (en) * 2020-09-24 2021-01-15 北京奥鹏远程教育中心有限公司 Online course knowledge tree generation association method
CN113779235A (en) * 2021-09-13 2021-12-10 北京市律典通科技有限公司 Word document outline recognition processing method and device
CN114390331A (en) * 2022-01-13 2022-04-22 徐州工业职业技术学院 Video teaching method, system, equipment and medium

Similar Documents

Publication Publication Date Title
US20190236102A1 (en) System and method for differential document analysis and storage
US9355171B2 (en) Clustering of near-duplicate documents
US8924395B2 (en) System and method for indexing electronic discovery data
CN109783787A (en) A kind of generation method of structured document, device and storage medium
Al-Zaidy et al. Automatic summary generation for scientific data charts
CN113961528A (en) Knowledge graph-based file semantic association storage system and method
JP2005063432A (en) Multimedia object retrieval apparatus and multimedia object retrieval method
CN112256861A (en) Rumor detection method based on search engine return result and electronic device
CN113569050A (en) Method and device for automatically constructing government affair field knowledge map based on deep learning
CN109299286A (en) The Knowledge Discovery Method and system of unstructured data
CN106599305B (en) Crowdsourcing-based heterogeneous media semantic fusion method
Sirsat et al. Pattern matching for extraction of core contents from news web pages
CN100336061C (en) Multimedia object searching device and methoed
US20080015843A1 (en) Linguistic Image Label Incorporating Decision Relevant Perceptual, Semantic, and Relationships Data
CN117763206A (en) Knowledge tree generation method and device, electronic equipment and storage medium
Lin et al. Combining a segmentation-like approach and a density-based approach in content extraction
CN115687566A (en) Method and device for full-text retrieval and retrieval result display
JP4148247B2 (en) Vocabulary acquisition method and apparatus, program, and computer-readable recording medium
Tohalino et al. Using virtual edges to extract keywords from texts modeled as complex networks
CN112613315A (en) Text knowledge automatic extraction method, device, equipment and storage medium
Shrivastav et al. Towards an ontology based framework for searching multimedia contents on the web
Adefowoke Ojokoh et al. Automated document metadata extraction
Lingwal Noise reduction and content retrieval from web pages
Ramya et al. XML based approach for object oriented medical video retrieval using neural networks
KR20130062667A (en) Apparatus and method for searching a file using file attribute

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination