CN113298914B - Knowledge chunk extraction method and device, electronic equipment and storage medium - Google Patents

Knowledge chunk extraction method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113298914B
CN113298914B CN202110859647.XA CN202110859647A CN113298914B CN 113298914 B CN113298914 B CN 113298914B CN 202110859647 A CN202110859647 A CN 202110859647A CN 113298914 B CN113298914 B CN 113298914B
Authority
CN
China
Prior art keywords
slide
page
title
knowledge
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110859647.XA
Other languages
Chinese (zh)
Other versions
CN113298914A (en
Inventor
曹梦娣
刘俊辰
陈奇宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Mininglamp Software System Co ltd
Original Assignee
Beijing Mininglamp Software System Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Mininglamp Software System Co ltd filed Critical Beijing Mininglamp Software System Co ltd
Priority to CN202110859647.XA priority Critical patent/CN113298914B/en
Publication of CN113298914A publication Critical patent/CN113298914A/en
Application granted granted Critical
Publication of CN113298914B publication Critical patent/CN113298914B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/60Editing figures and text; Combining figures or text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks

Abstract

The invention discloses a knowledge chunk extracting method and device, electronic equipment and a storage medium. The method comprises the steps of obtaining text information of each page of slide in a PPTX document; the text information comprises the text content in a text box in the slide, the position of the text box and the font size in the text box; determining the knowledge category of each page of slide according to the text information; and performing element extraction on each page of slide based on the knowledge category to obtain a knowledge block of each page of slide. By adopting the scheme provided by the invention, the knowledge chunks in the document can be extracted by utilizing the information such as the font format and the like in the PPTX document, and the extraction is more accurate.

Description

Knowledge chunk extraction method and device, electronic equipment and storage medium
Technical Field
The invention relates to the technical field of file processing, in particular to a knowledge chunk extracting method and device, electronic equipment and a storage medium.
Background
There are a large number of unstructured PPTX documents in an enterprise, such as product introductions, white papers, solutions, instruction documents, operation manuals, etc. The documents contain a large amount of high-quality information, and the high-quality information is extracted from the documents in a knowledge block mode and can be used for subsequent use of a user.
However, most document mining methods only extract knowledge chunks from documents by using pure text information in PPTX documents, and do not consider the content of the documents, such as format, font size, font position, etc., so that the extracted knowledge chunks are not accurate enough.
Disclosure of Invention
In order to solve the related technical problems, embodiments of the present invention provide a knowledge block extraction method, apparatus, electronic device, and storage medium.
The technical scheme of the embodiment of the invention is realized as follows:
the embodiment of the invention provides a knowledge chunk extracting method, which comprises the following steps:
acquiring text information of each page of slide in the PPTX document; the text information comprises the text content in a text box in the slide, the position of the text box and the font size in the text box;
determining the knowledge category of each page of slide according to the text information;
and performing element extraction on each page of slide based on the knowledge category to obtain a knowledge block of each page of slide.
In the above scheme, determining the knowledge category of each page of slide according to the text information includes:
determining titles and key points in each page of slide according to the text information;
judging whether each page of slide is a directory page or not by using the title and the key point;
in the case where the slide is not a catalog page, the title and key points are used to determine the knowledge category of the slide.
In the above scheme, determining the title and the key point in each page of the slide according to the text information includes:
taking a text box with the position of the text box in the text information within a preset range as a first candidate set;
selecting text content with the maximum font within a preset range of the number of words in a text box at the top position from the first candidate set as a title;
judging whether a preset phrase exists in the title or not;
when the title has a preset phrase, taking the text content with the largest font in other text boxes except the title in the slide as a key point;
and when the title does not have a preset phrase, if the font sizes of the text contents except the title in the slide are different, taking the text content with the largest font in the text contents except the title as a key point.
In the above scheme, determining whether each slide page is a directory page by using the title and the key point includes:
judging whether a preset phrase exists in the title or the key point of each page of the slide, or judging whether the title exists in each page of the slide, whether the number of the key points is less than a preset first threshold value, whether the number of the words of the key points is less than a preset second threshold value, whether the number of the words of the text except the title and the key points in the slide is less than a preset third threshold value, and whether a second preset phrase exists in the key points and the text;
and when a preset phrase exists in the title or the key point, or the slide has no title, the number of the key points is less than a preset first threshold value, the number of the words of the key points is less than a preset second threshold value, the number of the words of the text except the title and the key points in the slide is less than a preset third threshold value, and the second preset phrase exists in the key points and the text, determining the slide corresponding to the title or the key points as a catalog page.
In the above scheme, based on the knowledge category, performing element extraction on each page of slides, and acquiring the knowledge chunks of each page of slides includes:
judging whether each page of slide is a non-category slide or not according to the knowledge category;
when the slide is not a non-category slide, the slide is subjected to element extraction to acquire a knowledge block of the slide.
In the above scheme, the extracting elements from the slide and acquiring the knowledge block of the slide includes:
acquiring elements in a slide title, and taking the elements in the title as knowledge blocks of the slide;
when no element exists in the title of the slide, acquiring an element in a key point of the slide, and taking the element in the key point as a knowledge chunk of the slide;
when no key element exists in the key points of the slides, acquiring time elements of the slides adjacent to the slides, and taking the time elements of the slides adjacent to the slides as knowledge blocks of the slides;
when the slide has no time element adjacent to the slide, the element in the slide file name is acquired and is used as a knowledge chunk of the slide.
In the above scheme, the acquiring time elements of adjacent slides of a slide includes:
when the slide adjacent slides are the directory page, taking the elements in the titles of the slide adjacent slides as the time elements of the slide adjacent slides;
when there is no element in the title of the slide adjacent slide, taking the element in the key point of the slide adjacent slide as the time element of the slide adjacent slide;
when there is no element in the key point of the slide adjacent slide, it is determined that the slide adjacent slide has no time element.
In the above scheme, the acquiring time elements of adjacent slides of a slide includes:
when the slide adjacent slide is not the directory page, taking an element in the title of the slide adjacent slide as a time element of the slide adjacent slide;
when there is no element in the title of the slide adjacent slide, it is determined that the slide adjacent slide has no time element.
In the above scheme, the method further comprises:
the same knowledge blocks of successive pages are merged.
In the above scheme, the method further comprises:
and expanding the knowledge blocks by using the knowledge graph.
The embodiment of the invention also provides a knowledge chunk extracting device, which comprises:
the acquisition module is used for acquiring text information of each page of slide in the PPTX document; the text information comprises the text content in a text box in the slide, the position of the text box and the font size in the text box;
the determining module is used for determining the knowledge category of each page of slide according to the text information;
and the extraction module is used for extracting elements of each page of slide based on the knowledge category and acquiring the knowledge blocks of each page of slide.
An embodiment of the present invention further provides an electronic device, including: a processor and a memory for storing a computer program capable of running on the processor; wherein the content of the first and second substances,
the processor is adapted to perform the steps of any of the methods described above when running the computer program.
The embodiment of the invention also provides a storage medium, wherein a computer program is stored in the storage medium, and when the computer program is executed by a processor, the steps of any one of the methods are realized.
The knowledge chunk extracting method, the knowledge chunk extracting device, the electronic equipment and the storage medium provided by the embodiment of the invention are used for acquiring text information of each page of slide in a PPTX document; the text information comprises the text content in a text box in the slide, the position of the text box and the font size in the text box; determining the knowledge category of each page of slide according to the text information; and performing element extraction on each page of slide based on the knowledge category to obtain a knowledge block of each page of slide. By adopting the scheme provided by the invention, the knowledge chunks in the document can be extracted by utilizing the information such as the font format and the like in the PPTX document, and the extraction is more accurate.
Drawings
FIG. 1 is a schematic flow chart of a knowledge block extraction method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an extraction process according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of an extraction process according to an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a knowledge block extraction apparatus according to an embodiment of the present invention;
fig. 5 is an internal structural diagram of a computer device according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples.
The invention aims to mine a knowledge block from an unstructured PPTX document, and knowledge generally comprises two parts: elements and categories, such as: financial industry pain points (element: finance, category: industry pain points), product cases of household appliances (element: household appliances, category: product cases). The embodiment of the invention not only utilizes the text information in the PPTX document, but also utilizes some format font information and the like, thereby more accurately extracting the knowledge blocks in the PPTX. And after extraction, the knowledge graph is used for expansion, so that the extracted knowledge blocks can be used for subsequent knowledge recommendation, knowledge search and other relevant scenes.
Specifically, an embodiment of the present invention provides a knowledge chunk extracting method, as shown in fig. 1, where the method includes:
step 101: acquiring text information of each page of slide in the PPTX document; the text information comprises the text content in a text box in the slide, the position of the text box and the font size in the text box;
step 102: determining the knowledge category of each page of slide according to the text information;
step 103: and performing element extraction on each page of slide based on the knowledge category to obtain a knowledge block of each page of slide.
Here, a PPTX parse library can be used to obtain textual information for each slide in the PPTX document, such as python-PPTX.
Since the knowledge chunks are extracted by using the information such as the format, font size, font position, and the like in the document, the position of important knowledge can be determined more accurately by using the information such as the format, font size, font position, and the like in the document, and thus, the knowledge chunks extracted by the embodiment are more accurate.
Further, in one embodiment, determining the knowledge category of each slide page based on the textual information includes:
determining titles and key points in each page of slide according to the text information;
judging whether each page of slide is a directory page or not by using the title and the key point;
in the case where the slide is not a catalog page, the title and key points are used to determine the knowledge category of the slide.
Since the directory page cannot judge the knowledge category of the slide, the present embodiment first determines whether each slide is a directory page by using the title and the key point, and determines the knowledge category of the slide by using the title and the key point only in the case where the slide is not a directory page.
Further, in one embodiment, determining the title and the key point in each page of the slide according to the text information comprises:
taking a text box with the position of the text box in the text information within a preset range as a first candidate set;
selecting text content with the maximum font within a preset range of the number of words in a text box at the top position from the first candidate set as a title;
judging whether a preset phrase exists in the title or not;
when the title has a preset phrase, taking the text content with the largest font in other text boxes except the title in the slide as a key point;
and when the title does not have a preset phrase, if the font sizes of the text contents except the title in the slide are different, taking the text content with the largest font in the text contents except the title as a key point.
Specifically, the upper left certain range in the slide may be selected as the preset range. For example, the distance between the slide film and the top of the slide film page is within 2.2cm, and the distance between the slide film and the left end of the slide film page is within 5cm, which is a preset range.
According to the writing habit of the user, since the number of words of the title is not too long, when the text content with the largest font whose number of words is within the preset range in the text box at the top position can be selected from the first candidate set as the title, here, the preset range can be set to be less than 50 words.
Specifically, in actual application, the text box with the topmost position may be selected from the first candidate set, and then the text content with the largest font may be selected from the text box with the topmost position as the title. Here, the number of words of the selected text content needs to be within a preset range (for example, within 50 words), and when the number of words of the selected text content is out of the preset range, the selected text content is not taken as a title.
In practical application, the preset phrases may be set according to situations, for example, words such as a directory and an outline are set as the preset phrases.
Further, after the title and the key point are determined, the content of the text box in the slide besides the title and the key point can be determined as the body. Specifically, when the body is determined, the text contents may be sequentially spliced from top to bottom and from left to right of other text boxes in the slide, and the spliced contents are used as the body of the slide.
After the title and the key point are determined, it is possible to determine whether each page of the slide is a directory page using the title and the key point.
Specifically, in one embodiment, determining whether each slide page is a directory page using the title and the key points comprises:
judging whether a preset phrase exists in the title or the key point of each page of the slide, or judging whether the title exists in each page of the slide, whether the number of the key points is less than a preset first threshold value, whether the number of the words of the key points is less than a preset second threshold value, whether the number of the words of the text except the title and the key points in the slide is less than a preset third threshold value, and whether a second preset phrase exists in the key points and the text;
and when a preset phrase exists in the title or the key point, or the slide has no title, the number of the key points is less than a preset first threshold value, the number of the words of the key points is less than a preset second threshold value, the number of the words of the text except the title and the key points in the slide is less than a preset third threshold value, and the second preset phrase exists in the key points and the text, determining the slide corresponding to the title or the key points as a catalog page.
Here, the preset phrase may be set according to circumstances, and for example, words such as a directory and an outline may be set as the preset phrase. The words set forth herein may be the same as or different from those set forth in the above-described embodiments. In addition, the second preset phrase may be set to part/1st/2nd/3rd, etc.
Specifically, when determining whether a slide is a directory page, the determination may be made according to the following rules:
1) if the title of the slide has a preset phrase, the title is a directory page;
2) if the slide has no title but the preset phrase exists in the key points, the slide is a directory page;
3) if the slide has no title, the number of key points is small (e.g., less than 3), the number of key points and text words is small (e.g., the total number of key points is less than 10 words, and the total number of text words is less than 25 words), and the second predetermined phrase appears in the key points or text as a table of contents page.
Further, a classification model may be utilized to determine whether each page of slides is a directory page. Specifically, the classification model can be obtained by converting each slide page into an image format, manually labeling each slide with some data, and training with the labeled slides. The trained classification model can be used to determine whether a slide is a catalog page.
Further, after the judgment of the directory page is made, the remaining slides other than the directory page may be subjected to the judgment of the knowledge category. Specifically, the titles and key points in the slides except the catalog page can be conveyed to the knowledge category classifier to judge the knowledge category. The judged knowledge category can be no category, industry pain point, product introduction, product advantage, product function, product case, product architecture and the like, and the specific category can be determined according to data condition or business requirement. Here, the directory page does not determine the knowledge type.
Here, the knowledge class classifier can be obtained by training through an available labeled data set. The trained knowledge class classifier may be used to make a determination of the knowledge class.
Of course, instead of using a knowledge class classifier, the determination of the knowledge class may be performed using a keyword-based method without a large amount of labeled data. After data exploration, a batch of keywords are manually determined to judge the knowledge category. For example, a slide containing words such as product summary/brief/introduction/location may be determined as a product brief.
After determining the knowledge category of each slide page, in one embodiment, performing element extraction on each slide page based on the knowledge category, and acquiring the knowledge chunk of each slide page comprises:
judging whether each page of slide is a non-category slide or not according to the knowledge category;
when the slide is not a non-category slide, the slide is subjected to element extraction to acquire a knowledge block of the slide.
Since the elements are extracted only for slides with knowledge categories, when the slides are non-category slides, element extraction is not performed (i.e., entity recognition is performed to identify entities in the text given the input text), and when the slides are not non-category slides, element extraction is performed to obtain knowledge chunks of the slides. For example, when the knowledge category is a product introduction, a product advantage or a product architecture, etc., the elements in the slide are extracted as a specific product or item; and when the knowledge category is an industry pain point, extracting the elements in the slide to be a specific industry or business.
Further, in one embodiment, the extracting the elements from the slide and the obtaining the knowledge blocks of the slide includes:
acquiring elements in a slide title, and taking the elements in the title as knowledge blocks of the slide;
when no element exists in the title of the slide, acquiring an element in a key point of the slide, and taking the element in the key point as a knowledge chunk of the slide;
when no key element exists in the key points of the slides, acquiring time elements of the slides adjacent to the slides, and taking the time elements of the slides adjacent to the slides as knowledge blocks of the slides;
when the slide has no time element adjacent to the slide, the element in the slide file name is acquired and is used as a knowledge chunk of the slide.
In actual extraction, when there is an available data label set, a trained model according to the available data label set may be used to perform element extraction, for example, a trained sequence label model such as CRF, LSTM + CRF, Bert + CRF, and the like. When there is no data label set available, elements can be extracted using a dictionary-based approach, for example, by performing inductive sorting on various types of entities that are desired to be extracted, forming a dictionary, and performing element extraction using dictionary matching.
Specifically, when the adjacent slide is determined, the slide closest to the slide may be regarded as the adjacent slide, and when there is no time element to the closest slide, the slide closest to the second may be regarded as the adjacent slide; when the slide close to the second has no time element, the slide close to the third is taken as the adjacent slide, and so on until the slide with the set distance range is reached. Here, the distance range may be set to a range of 10 pages from the current slide.
Further, in one embodiment, obtaining the time elements of the slides adjacent to the slide comprises:
when the slide adjacent slides are the directory page, taking the elements in the titles of the slide adjacent slides as the time elements of the slide adjacent slides;
when there is no element in the title of the slide adjacent slide, taking the element in the key point of the slide adjacent slide as the time element of the slide adjacent slide;
when there is no element in the key point of the slide adjacent slide, it is determined that the slide adjacent slide has no time element.
Meanwhile, in an embodiment, acquiring the time element of the slide adjacent to the slide further includes:
when the slide adjacent slide is not the directory page, taking an element in the title of the slide adjacent slide as a time element of the slide adjacent slide;
when there is no element in the title of the slide adjacent slide, it is determined that the slide adjacent slide has no time element.
In particular, in actual application, the same knowledge blocks of the consecutive pages may also be merged (i.e. the starting page number of the knowledge block may be modified). For example, three knowledge blocks, starting at page 3 and ending at page 3, starting at page 4 and ending at page 4, and starting at page 5 and ending at page 5, may be combined into one knowledge block starting at page 3 and ending at page 5.
In addition, after the knowledge blocks are acquired, the knowledge blocks may be expanded using a knowledge graph.
A knowledge graph is a graph-based data structure, which is composed of nodes and edges, each node represents an entity, such as an employee, a product, a company, etc., each edge is a relationship between entities, and is essentially a semantic network that exposes relationships between entities, and can link all information together.
In particular, the knowledge component can be extended with nodes and relationships in the knowledge graph. For example, in a knowledge graph, an A product is a child of a B product, and the knowledge blocks of the A product can be extended to the product functionality of the B product. The specific extension mode is determined according to a specific knowledge graph.
The knowledge chunk extracting method provided by the embodiment of the invention obtains the text information of each page of slide in the PPTX document; the text information comprises the text content in a text box in the slide, the position of the text box and the font size in the text box; determining the knowledge category of each page of slide according to the text information; and performing element extraction on each page of slide based on the knowledge category to obtain a knowledge block of each page of slide. By adopting the scheme provided by the invention, the knowledge chunks in the document can be extracted by utilizing the information such as the font format and the like in the PPTX document, and the extraction is more accurate.
The present invention will be described in further detail with reference to the following application examples.
The application embodiment directly extracts knowledge blocks by using text, format information and the like of the unstructured PPTX document, and expands by using a knowledge graph after extraction is finished; it is more accurate than extracting entities, relationships, etc. from a document for augmenting a knowledge graph, or extracting answers from a document based on a given question.
Specifically, referring to fig. 2, the main steps of the embodiment of the present application for mining knowledge chunks from unstructured PPTX documents are as follows:
step 1: and acquiring information such as text content, text box position, font size and the like of each page of slide in the PPTX document.
Step 2: and judging the knowledge category of each page of slide according to the extracted text content.
And step 3: and determining elements to be extracted according to the knowledge categories, thereby obtaining knowledge blocks of the page of the slides.
And 4, step 4: and merging the knowledge blocks with the same continuous pages, and finally expanding the knowledge elements according to the knowledge graph.
The above steps will be described in detail with reference to fig. 3.
Step 1:
the contents of each slide page in the PPTX document are parsed using python-PPTX, and the position of each text box (distance from the top and left of the slide), the text content in the text box, and the text font size are recorded.
By observing a large amount of data and the inertia law of most people editing documents, the strategy for extracting the titles, the key points and the texts is summarized as follows:
1) title
Firstly, selecting a text box with the position within a preset range (within 2.2cm from the top and within 5cm from the left end) at the upper left as a candidate, then selecting the text content with the maximum font in the topmost text box as a title, and controlling the number of words within a certain range (within 50 words).
2) Key points
After extracting the title, judging whether words such as 'directory', 'content', 'outline' and the like exist in the title, and if so, extracting the text content with the largest font in the other texts except the title as a key point. And if not, judging whether the sizes of the fonts of the other texts except the title are equal, and if not, taking the text content with the largest font as a key point.
3) Text
And the texts except the titles and the key points are obtained by splicing the text contents from top to bottom and from left to right according to the positions of the text boxes.
And extracting the title, key points and text of each page of slide according to the rule by using the content information obtained by analysis. And further judging whether the slide film page is a directory page or not according to the extracted contents such as the title, the key point, the text and the like. By exploring a large amount of document data, the policy of roughly judging the directory page is summarized as follows:
1) if the title has words such as content/directory/outline, the title is a directory page;
2) if no title exists but words such as content/directory/outline exist in the key points, the key points are directory pages;
3) if there is no title, the number of key points is small (less than 3), the number of key points and text words is small (the total number of key points is less than 10, the total number of text words is less than 25), and the words of part/1st/2nd/3rd and the like appear in the key points or text to be directory pages and the like.
Or each slide film page can be converted into an image format, some data are manually marked, and an image two-classification model for judging whether the slide film page is a catalog page is trained for judgment.
In the above-mentioned various generalized strategies, the preset threshold value in parentheses is obtained by summarizing and summarizing observed data, and in the actual application process, fine tuning modification can be performed according to specific data conditions.
Step 2:
in addition to the catalog page, for each slide, the extracted titles and key points are fed into a knowledge category classifier to determine its knowledge category, such as possible no category, industry pain point, product introduction, product advantage, product function, product case, product architecture, etc., and the specific category may be determined according to data situation or business requirement. (Category pages do not make a determination of knowledge type)
For the knowledge category classifier, if there is not a large amount of labeled data, the judgment of the knowledge category can be made by manually determining a batch of keywords through data exploration by using a keyword-based method, such as words containing product summary/brief description/introduction/positioning, and the like, and then judging the product brief description. If the available labeled data sets exist, a knowledge multi-classification network model can be trained to judge the knowledge classes.
And step 3:
according to the knowledge category, except the non-category slide film page, extracting the elements corresponding to the knowledge category of the slide film page, if the knowledge category is product introduction, product advantage, product architecture and the like, extracting the elements as a specific certain product or project, and if the knowledge category is an industry pain point, extracting the elements as a specific certain industry or business.
Element extraction, namely entity recognition, gives an input text and recognizes entities in the text. If there is not a large amount of labeled data, using a dictionary-based method to summarize and sort the various types of entities to be extracted to form a dictionary, and using dictionary matching to extract the entities. If available label sets exist, sequence label models such as CRF, LSTM + CRF, Bert + CRF and the like can be trained to perform entity extraction.
The priority of slide page element selection is that element in slide title of current page- > element of key point of current slide- > time element (namely, time element of slide page nearest to current page is used as current page element and limited in 10 pages) > file name element (namely, element for extracting PPTX document name). If none of the elements are extracted, then the page of slides has no knowledge. The method for extracting time elements of slide pages is described in detail below.
Time element extraction: if the page slide is a catalog page, the title element is firstly extracted, if the page slide exists, the time element of the page slide is extracted, and if the page slide does not exist, the key point element is extracted as the time element of the page slide. If the slide show of the current page is a non-directory page, only the title element is extracted as the time element of the slide show of the current page.
As can be seen from the above, if a slide of a certain page is a non-directory page, the knowledge category is not no category, and elements are extracted, a knowledge chunk is obtained that starts with the page and ends with the page.
And 4, step 4:
and if the knowledge blocks of the continuous pages are the same, merging the knowledge blocks and modifying the initial page number of the knowledge blocks. The following three knowledge blocks, finance industry pain spots beginning at page 3 and ending at page 3, finance industry pain spots beginning at page 4 and ending at page 4, finance industry pain spots beginning at page 5 and ending at page 5, may be combined into one knowledge block beginning at page 3 and ending at page 5.
And finally, expanding the knowledge blocks according to the existing knowledge graph, wherein the extracted knowledge blocks are the product functions of the product A, and the knowledge blocks can be expanded to the product functions of the product B if the product A is a sub-product of the product B in the knowledge graph. How to expand can be determined based on a specific knowledge graph.
Here, it should be noted that, in addition to using the python-PPTX library to perform content extraction, the present embodiment may also use other PPTX parsing libraries to perform content extraction.
The knowledge chunks are extracted by using the text, the format information and the like of the unstructured PPTX document, and compared with a mode of extracting the knowledge chunks by using pure text information in the PPTX document, the extracted knowledge chunks are more accurate.
In order to implement the method according to the embodiment of the present invention, an embodiment of the present invention further provides a knowledge block extraction apparatus, as shown in fig. 4, the knowledge block extraction apparatus 400 includes: an acquisition module 401, a determination module 402 and an extraction module 403; wherein the content of the first and second substances,
an obtaining module 401, configured to obtain text information of each page of slide in a PPTX document; the text information comprises the text content in a text box in the slide, the position of the text box and the font size in the text box;
a determining module 402, configured to determine a knowledge category of each page of slide according to the text information;
and an extraction module 403, configured to perform element extraction on each page of slide based on the knowledge category, and obtain a knowledge chunk of each page of slide.
In practical applications, the obtaining module 401, the determining module 402 and the extracting module 403 may be implemented by a processor in the knowledge block extracting apparatus.
It should be noted that: the above-mentioned apparatus provided in the above-mentioned embodiment is only exemplified by the division of the above-mentioned program modules when executing, and in practical application, the above-mentioned processing may be distributed to be completed by different program modules according to needs, that is, the internal structure of the terminal is divided into different program modules to complete all or part of the above-mentioned processing. In addition, the apparatus provided by the above embodiment and the method embodiment belong to the same concept, and the specific implementation process thereof is described in the method embodiment and is not described herein again.
Based on the hardware implementation of the program module, in order to implement the method according to the embodiment of the present invention, an electronic device (computer device) is also provided in the embodiment of the present invention. Specifically, in one embodiment, the computer device may be a terminal, and its internal structure diagram may be as shown in fig. 5. The computer apparatus includes a processor a01, a network interface a02, a display screen a04, an input device a05, and a memory (not shown in the figure) connected through a system bus. Wherein processor a01 of the computer device is used to provide computing and control capabilities. The memory of the computer device comprises an internal memory a03 and a non-volatile storage medium a 06. The nonvolatile storage medium a06 stores an operating system B01 and a computer program B02. The internal memory a03 provides an environment for the operation of the operating system B01 and the computer program B02 in the nonvolatile storage medium a 06. The network interface a02 of the computer device is used for communication with an external terminal through a network connection. The computer program is executed by the processor a01 to implement the method of any of the above embodiments. The display screen a04 of the computer device may be a liquid crystal display screen or an electronic ink display screen, and the input device a05 of the computer device may be a touch layer covered on the display screen, a button, a trackball or a touch pad arranged on a casing of the computer device, or an external keyboard, a touch pad or a mouse.
Those skilled in the art will appreciate that the architecture shown in fig. 5 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
The device provided by the embodiment of the present invention includes a processor, a memory, and a program stored in the memory and capable of running on the processor, and when the processor executes the program, the method according to any one of the embodiments described above is implemented.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include transitory computer readable media (transmyedia) such as modulated data signals and carrier waves.
It will be appreciated that the memory of embodiments of the invention may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic random access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical disk, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Enhanced Synchronous Dynamic Random Access Memory (Enhanced DRAM), Synchronous Dynamic Random Access Memory (SLDRAM), Direct Memory (DRmb Access), and Random Access Memory (DRAM). The described memory for embodiments of the present invention is intended to comprise, without being limited to, these and any other suitable types of memory.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (11)

1. A knowledge chunk extraction method, the method comprising:
acquiring text information of each page of slide in the PPTX document; the text information comprises the text content in a text box in the slide, the position of the text box and the font size in the text box;
determining the knowledge category of each page of slide according to the text information;
based on the knowledge category, performing element extraction on each page of slide to obtain a knowledge chunk of each page of slide;
wherein the determining the knowledge category of each page of slide according to the text information comprises:
determining a title and key points in each page of slide according to the text information;
judging whether each page of slide is a directory page or not by using the title and the key point;
in the case that the slide is not a catalog page, determining a knowledge category of the slide using the title and the key point;
wherein, the judging whether each page of slide is a directory page by using the title and the key point comprises:
judging whether a preset phrase exists in the title or the key point of each page of the slide, or judging whether the title exists in each page of the slide, whether the number of the key points is less than a preset first threshold value, whether the number of the words of the key points is less than a preset second threshold value, whether the number of the words of the text except the title and the key points in the slide is less than a preset third threshold value, and whether a second preset phrase exists in the key points and the text;
and when a preset phrase exists in the title or the key point, or the slide has no title, the number of the key points is less than a preset first threshold value, the number of the words of the key points is less than a preset second threshold value, the number of the words of the text except the title and the key points in the slide is less than a preset third threshold value, and the second preset phrase exists in the key points and the text, determining that the slide corresponding to the title or the key points is a catalog page.
2. The method of claim 1, wherein determining the title and the key point in each slide according to the text information comprises:
taking the text box with the position of the text box in the text information within a preset range as a first candidate set;
selecting the text content with the maximum font within a preset range of the number of words in the text box at the top position from the first candidate set as a title;
judging whether a preset phrase exists in the title or not;
when the title has a preset phrase, taking the text content with the largest font in other text boxes except the title in the slide as a key point;
and when no preset phrase exists in the title, if the font sizes of the other text contents except the title in the slide are different, taking the text content with the largest font in the other text contents except the title as a key point.
3. The method of claim 1, wherein the extracting elements from each page of slides based on the knowledge category, and wherein obtaining knowledge chunks for each page of slides comprises:
judging whether each page of slide is a non-category slide or not according to the knowledge category;
and when the slide is not the non-category slide, performing element extraction on the slide to acquire a knowledge block of the slide.
4. The method of claim 3, wherein the extracting elements from the slide and retrieving the knowledge blocks of the slide comprises:
acquiring elements in the slide title, and taking the elements in the title as knowledge blocks of the slide;
when no element exists in the title of the slide, acquiring an element in the key point of the slide, and taking the element in the key point as a knowledge block of the slide;
when no element exists in the key points of the slide, acquiring the time element of the slide adjacent to the slide, and taking the time element of the slide adjacent to the slide as a knowledge block of the slide;
and when the slide adjacent to the slide has no time element, acquiring an element in the file name of the slide, and taking the element in the file name as a knowledge chunk of the slide.
5. The method of claim 4, wherein obtaining the time element of the slide neighboring slide comprises:
when the slide adjacent slide is a directory page, taking an element in the title of the slide adjacent slide as a time element of the slide adjacent slide;
when there is no element in the title of the slide neighboring slide, taking an element in the key point of the slide neighboring slide as the time element of the slide neighboring slide;
determining that the slide neighboring slide has no time element when there is no element in the key point of the slide neighboring slide.
6. The method of claim 4, wherein obtaining the time element of the slide neighboring slide comprises:
when the slide adjacent slide is not a catalog page, taking an element in a title of the slide adjacent slide as a time element of the slide adjacent slide;
determining that the slide neighboring slide has no time element when there is no element in the title of the slide neighboring slide.
7. The method of claim 1, further comprising:
the same knowledge blocks of successive pages are merged.
8. The method of claim 1, further comprising:
and expanding the knowledge blocks by using a knowledge graph.
9. A knowledge block extraction apparatus, characterized by comprising:
the acquisition module is used for acquiring text information of each page of slide in the PPTX document; the text information comprises the text content in a text box in the slide, the position of the text box and the font size in the text box;
the determining module is used for determining the knowledge category of each page of slide according to the text information;
the extraction module is used for extracting elements of each page of slide based on the knowledge category to obtain a knowledge chunk of each page of slide;
wherein the determining module is further configured to:
determining a title and key points in each page of slide according to the text information;
judging whether each page of slide is a directory page or not by using the title and the key point;
in the case that the slide is not a catalog page, determining a knowledge category of the slide using the title and the key point;
wherein the determining module is further configured to:
judging whether a preset phrase exists in the title or the key point of each page of the slide, or judging whether the title exists in each page of the slide, whether the number of the key points is less than a preset first threshold value, whether the number of the words of the key points is less than a preset second threshold value, whether the number of the words of the text except the title and the key points in the slide is less than a preset third threshold value, and whether a second preset phrase exists in the key points and the text;
and when a preset phrase exists in the title or the key point, or the slide has no title, the number of the key points is less than a preset first threshold value, the number of the words of the key points is less than a preset second threshold value, the number of the words of the text except the title and the key points in the slide is less than a preset third threshold value, and the second preset phrase exists in the key points and the text, determining that the slide corresponding to the title or the key points is a catalog page.
10. An electronic device, comprising: a processor and a memory for storing a computer program capable of running on the processor; wherein the content of the first and second substances,
the processor is adapted to perform the steps of the method of any one of claims 1 to 8 when running the computer program.
11. A storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, performs the steps of the method of any one of claims 1 to 8.
CN202110859647.XA 2021-07-28 2021-07-28 Knowledge chunk extraction method and device, electronic equipment and storage medium Active CN113298914B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110859647.XA CN113298914B (en) 2021-07-28 2021-07-28 Knowledge chunk extraction method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110859647.XA CN113298914B (en) 2021-07-28 2021-07-28 Knowledge chunk extraction method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113298914A CN113298914A (en) 2021-08-24
CN113298914B true CN113298914B (en) 2021-10-15

Family

ID=77331271

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110859647.XA Active CN113298914B (en) 2021-07-28 2021-07-28 Knowledge chunk extraction method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113298914B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110427623A (en) * 2019-07-24 2019-11-08 深圳追一科技有限公司 Semi-structured document Knowledge Extraction Method, device, electronic equipment and storage medium
CN111209411A (en) * 2020-01-03 2020-05-29 北京明略软件系统有限公司 Document analysis method and device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190006027A1 (en) * 2017-06-30 2019-01-03 Accenture Global Solutions Limited Automatic identification and extraction of medical conditions and evidences from electronic health records
CN107358208B (en) * 2017-07-14 2018-07-13 北京神州泰岳软件股份有限公司 A kind of PDF document structured message extracting method and device
CN109101491B (en) * 2018-07-24 2021-12-17 湖南星汉数智科技有限公司 Author information extraction method and device, computer device and computer readable storage medium
CN109800303A (en) * 2018-12-28 2019-05-24 深圳市世强元件网络有限公司 A kind of document information extracting method, storage medium and terminal

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110427623A (en) * 2019-07-24 2019-11-08 深圳追一科技有限公司 Semi-structured document Knowledge Extraction Method, device, electronic equipment and storage medium
CN111209411A (en) * 2020-01-03 2020-05-29 北京明略软件系统有限公司 Document analysis method and device

Also Published As

Publication number Publication date
CN113298914A (en) 2021-08-24

Similar Documents

Publication Publication Date Title
CN110083805B (en) Method and system for converting Word file into EPUB file
AU2016203856B2 (en) System and method for automating information abstraction process for documents
US10789281B2 (en) Regularities and trends discovery in a flow of business documents
US20170337260A1 (en) Method and device for storing data
US20130305149A1 (en) Document reader and system for extraction of structural and semantic information from documents
CN101271459A (en) Word library generation method, input method and input method system
US10042880B1 (en) Automated identification of start-of-reading location for ebooks
US20180046708A1 (en) System and Method for Automatic Detection and Clustering of Articles Using Multimedia Information
CN108304530B (en) Knowledge base entry classification method and device and model training method and device
US20220342896A1 (en) Method and system for document indexing and retrieval
CN116484808A (en) Method and device for generating controllable text for official document
Chua et al. DeepCPCFG: deep learning and context free grammars for end-to-end information extraction
CN113298914B (en) Knowledge chunk extraction method and device, electronic equipment and storage medium
US6470362B1 (en) Extracting ordered list of words from documents comprising text and code fragments, without interpreting the code fragments
CN112818687B (en) Method, device, electronic equipment and storage medium for constructing title recognition model
CN111401047A (en) Method and device for generating dispute focus of legal document and computer equipment
CN110955845A (en) User interest identification method and device, and search result processing method and device
Gephart et al. Qualitative Data Analysis: Three Microcomputer-Supported Approaches.
Modi et al. Multimodal web content mining to filter non-learning sites using NLP
CN113468377A (en) Video and literature association and integration method
CN113723047A (en) Map construction method, device and medium based on legal document
CN110968691B (en) Judicial hotspot determination method and device
Hast et al. Making large collections of handwritten material easily accessible and searchable
CN111241313A (en) Retrieval method and device supporting image input
CN111708891A (en) Food material entity linking method and device among multi-source food material data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant