CA3148074A1

CA3148074A1 - Text information extracting method, device, computer equipment and storage medium

Info

Publication number: CA3148074A1
Application number: CA3148074A
Authority: CA
Inventors: Zeyang Meng
Original assignee: 10353744 Canada Ltd
Current assignee: 10353744 Canada Ltd
Priority date: 2021-02-09
Filing date: 2022-02-08
Publication date: 2022-08-09
Also published as: CN112989795A

Abstract

The present invention discloses a text information extracting method, and corresponding device, computer equipment and storage medium. The method comprises: obtaining a text to be extracted and an extracting rule corresponding to the text to be extracted, wherein the extracting rule includes an extracting field, determining, based on a file directory of the text to be extractedõ and generating chapter information, partitioning the chapter information according to a preset rule, and generating a corresponding partition list, and generating key-value pair information corresponding to the text to be extracted according to the partition list and the extracting rule and storing the information into a database.

Description

TEXT INFORMATION EXTRACTING METHOD, DEVICE, COMPUTER
EQUIPMENT AND STORAGE MEDIUM
BACKGROUND OF THE INVENTION
Technical Field [0001] The present invention relates to the field of data processing technology, and more particularly to a text information extracting method, and corresponding device, computer equipment and storage medium.
Description of Related Art

[0002] Announcement text information in the field of finance is usually extremely complicated and lengthy, for example, such common texts of types relevant to public prospectuses, contract announcements, etc. They are usually merged and summarized from information in the order of magnitude reaching several hundred pages. As for the fund information extracting task, it is modus operandi in the art to copy and extract the information through manual operation and maintenance, or through the simple regular expression extraction.

[0003] However, the aforementioned traditional processing modes are defective more or less apparently. For instance, the purely manual information extraction mode necessitates extremely large workload that includes many repetitive works, so the mode is low in efficiency and high in manpower cost. As for the simple regular expression extraction, the problem of missing extraction of information might occur, in particular when the volume of text published by an announcement is extremely large, information extraction errors usually occur due to information similarity between different chapters and paragraphs, and a great deal of manpower is required for proofreading and checking. In addition, since different fund issuers do not have a unified structural requirement on Date Recue/Date Received 2022-02-08 characters, the descriptive texts of find status changes are usually merged and elliptically described to differing extents, and all these practices also cause failures to the regular expression extraction mode.

[0004] In short, there is an urgent need to propose a novel long text information extracting method to address the aforementioned problems.
SUMMARY OF THE INVENTION

[0005] In order to solve problems pending in the state of the art, embodiments of the present invention provide a long text information extracting method, and corresponding device, computer equipment and storage medium, so as to overcome the problems existing in the prior-art technology in which workload for information extraction is large, efficiency is low and manpower cost is high, and missing and errors tend to occur.

[0006] To solve one or more of the aforementioned technical problem(s), the present invention employs the following technical solutions.

[0007] According to the first aspect, there is provided a long text information extracting method that comprises the following steps:

[0008] obtaining a text to be extracted and an extracting rule corresponding to the text to be extracted, wherein the extracting rule includes an extracting field;

[0009] determining, based on a file directory of the text to be extracted, a chapter location of each piece of directory information of the file directory in the text to be extracted, and generating chapter information;

[0010] partitioning the chapter information according to a preset rule, and generating a corresponding partition list; and

[0011] generating key-value pair information corresponding to the text to be extracted according to the partition list and the extracting rule and storing the information into a database, Date Recue/Date Received 2022-02-08 wherein the key includes an extracting field, and the value includes the partition list and target information corresponding to the extracting field.

[0012] Further, the partition list includes paragraph lists and sentence lists, and the step of partitioning the chapter information according to a preset rule, and generating a corresponding partition list includes:

[0013] partitioning each piece of the chapter information into paragraphs according to preset paragraph features, and generating corresponding paragraph lists respectively;
and

[0014] partitioning each paragraph in each paragraph list into sentences according to preset sentence features, and generating corresponding sentence lists respectively.

[0015] Further, when the target information corresponding to the extracting field is long text information, the step of generating key-value pair information corresponding to the text to be extracted according to the partition list and the extracting rule and storing the information into a database includes:

[0016] determining a first paragraph or a first sentence in the partition list in which the extracting field locates, and determining a second paragraph adjacent to the first paragraph or a second sentence adjacent to the first sentence;

[0017] searching for the first paragraph and the second paragraph or the first sentence and the second sentence by use of a preset searching rule, and determining the target information corresponding to the extracting field; and

[0018] generating key-value pair information corresponding to the text to be extracted according to the extracting field and the target information and storing the information into a database.

[0019] Further, when the target information corresponding to the extracting field is short text information, the step of generating key-value pair information corresponding to the text to be extracted according to the partition list and the extracting rule and storing the information into a database includes:

Date Recue/Date Received 2022-02-08

[0020] performing a target detection process on the sentences in the partition list, and obtaining the target information corresponding to the extracting field; and

[0021] generating key-value pair information corresponding to the text to be extracted according to the extracting field and the target information and storing the information into a database.

[0022] Further, when the extracting field is status change, the step of generating key-value pair information corresponding to the text to be extracted according to the partition list and the extracting rule and storing the information into a database includes:

[0023] obtaining business status change information in the sentence lists according to the extracting rule, and generating key-value pair information corresponding to the text to be extracted according to the business status change information and the extracting field and storing the information into a database.

[0024] Further, prior to storing the key-value pair information into a database, the method further comprises:

[0025] denoising the key-value pair information, and storing the denoised key-value pair information into the database.

[0026] Further, the extracting rule includes a regular expression.

[0027] According to the second aspect, there is provided a text information extracting device that comprises:

[0028] a data obtaining module, for obtaining a text to be extracted and an extracting rule corresponding to the text to be extracted, wherein the extracting rule includes an extracting field;

[0029] a chapter obtaining module, for determining, based on a file directory of the text to be extracted, a chapter location of each piece of directory information of the file directory in the text to be extracted, and generating chapter information;

Date Recue/Date Received 2022-02-08

[0030] a data partitioning module, for partitioning the chapter information according to a preset rule, and generating a corresponding partition list; and

[0031] an information generating module, for generating key-value pair information corresponding to the text to be extracted according to the partition list and the extracting rule and storing the information into a database, wherein the key includes an extracting field, and the value includes the partition list and information corresponding to the extracting field.

[0032] According to the third aspect, there is provided a computer equipment that comprises a memory, a processor and a computer program stored on the memory and operable on the processor, and the following steps are realized when the processor executes the computer program:

[0033] obtaining a text to be extracted and an extracting rule corresponding to the text to be extracted, wherein the extracting rule includes an extracting field;

[0034] determining, based on a file directory of the text to be extracted, a chapter location of each piece of directory information of the file directory in the text to be extracted, and generating chapter information;

[0035] partitioning the chapter information according to a preset rule, and generating a corresponding partition list; and

[0036] generating key-value pair information corresponding to the text to be extracted according to the partition list and the extracting rule and storing the information into a database, wherein the key includes an extracting field, and the value includes the partition list and target information corresponding to the extracting field.

[0037] According to the fourth aspect, there is provided a computer-readable storage medium storing a computer program thereon, and the following steps are realized when the computer program is executed by a processor:

[0038] obtaining a text to be extracted and an extracting rule corresponding to the text to be extracted, wherein the extracting rule includes an extracting field;
Date Recue/Date Received 2022-02-08

[0039] determining, based on a file directory of the text to be extracted, a chapter location of each piece of directory information of the file directory in the text to be extracted, and generating chapter information;

[0040] partitioning the chapter information according to a preset rule, and generating a corresponding partition list; and

[0041] generating key-value pair information corresponding to the text to be extracted according to the partition list and the extracting rule and storing the information into a database, wherein the key includes an extracting field, and the value includes the partition list and target information corresponding to the extracting field.

[0042] The technical solutions provided by the embodiments of the present invention bring about the following advantageous effects.

[0043] In the text information extracting method, and corresponding device, computer equipment and storage medium provided by the embodiments of the present invention, by obtaining a text to be extracted and an extracting rule corresponding to the text to be extracted, wherein the extracting rule includes an extracting field, determining, based on a file directory of the text to be extracted, a chapter location of each piece of directory information of the file directory in the text to be extracted, and generating chapter information, partitioning the chapter information according to a preset rule, and generating a corresponding partition list, and generating key-value pair information corresponding to the text to be extracted according to the partition list and the extracting rule and storing the information into a database, wherein the key includes an extracting field, and the value includes the partition list and target information corresponding to the extracting field, on the one hand, the present invention enhances the efficiency of text extraction, avoids such problems as missing from and error in information extraction, and enhances the precision of text extraction, on the other hand, by dividing the long text, the present invention avoids the circumstance of infinite backtracking that might be encountered in regular matching, enhances fault tolerance rate of codes, and reduces time Date Recue/Date Received 2022-02-08 consumption in the overall operation.

[0044] In the text information extracting method, and corresponding device, computer equipment and storage medium provided by the embodiments of the present invention, by partitioning each piece of the chapter information into paragraphs according to preset paragraph features, and generating corresponding paragraph lists respectively, and partitioning each paragraph in each paragraph list into sentences according to preset sentence features, and generating corresponding sentence lists respectively, the text is precisely positioned to the levels of chapter, paragraph and sentence through the directory hierarchy positioning mode, so that to effect precise positioning and to extract relevant information from the text to be extracted.

[0045] In the text information extracting method, and corresponding device, computer equipment and storage medium provided by the embodiments of the present invention, by denoising the key-value pair information and storing the denoised key-value pair information into the database, the key-value pair information extracted from the text is further screened and filtered, whereby the precision in long text information extraction is effectively enhanced.
BRIEF DESCRIPTION OF THE DRAWINGS

[0046] To more clearly describe the technical solutions in the embodiments of the present invention, drawings required to illustrate the embodiments will be briefly introduced below. Apparently, the drawings introduced below are merely directed to some embodiments of the present invention, while persons ordinarily skilled in the art may further acquire other drawings on the basis of these drawings without spending creative effort in the process.

[0047] Fig. 1 is a flowchart illustrating a fund announcement long text information extracting Date Recue/Date Received 2022-02-08 method according to an exemplary embodiment;

[0048] Fig. 2 is a flowchart illustrating a fund status change information extracting method according to an exemplary embodiment;

[0049] Fig. 3 is a flowchart illustrating a text information extracting method according to an exemplary embodiment;

[0050] Fig. 4 is a view schematically illustrating the structure of a text information extracting device according to an exemplary embodiment; and

[0051] Fig. 5 is a view schematically illustrating the internal structure of a computer equipment according to an exemplary embodiment.
DETAILED DESCRIPTION OF THE INVENTION

[0052] To make more lucid and clear the objectives, technical solutions and advantages of the present invention, the technical solutions in the embodiments of the present invention will be clearly and comprehensively described below with reference to the accompanying drawings in the embodiments of the present invention. Apparently, the embodiments as described are merely partial, rather than the entire, embodiments of the present invention.
Any other embodiments makeable by persons ordinarily skilled in the art on the basis of the embodiments in the present invention without creative effort shall all fall within the protection scope of the present invention.

[0053] As noted in the Description of Related Art, as regards such common information disclosure texts of types relevant to fund information public prospectuses and fund contracts, etc., they are usually merged and summarized from information in the order of magnitude reaching several hundred pages. Accordingly, the workload in information Date Recue/Date Received 2022-02-08 extraction of these types of texts is extremely large, and such problems as missing and errors tend to occur.

[0054] To solve the above problems, a text information extracting method is creatively proposed in the embodiments of the present invention, starting from the document structure of the text to be extracted, the method precisely positions the text to the levels of chapter, paragraph and sentence through the directory hierarchy positioning mode; with respect to the extraction of generalization information in the text to be extracted, sentences and paragraphs are taken as input data, data information required to be extracted is automatically detected through a scheme of multiple rules, and denoising and calibration are performed thereon, so as to obtain key-value pair information to which the generalization information corresponds; likewise, with respect to the extraction of status change information of the business involved in the text to be extracted, a parts of speech hierarchy is established for the descriptive words of statuses, and a status change list is extracted through the combination form of [action-business]. The precision of information extraction is not only ensured, but the problems of missing for and errors in information extraction are also avoided; moreover, by dividing the long text, the present invention avoids the circumstance of infinite backtracking that might be encountered in regular matching, enhances fault tolerance rate of codes, and reduces time consumption in the overall operation.

[0055] Embodiment 1

[0056] Specifically, as shown in Fig. 1, taking for example a fund-related disclosure text, the process of employing the aforementioned method to extract information from a fund announcement long text includes the following steps.

[0057] Step 1 ¨ obtaining an original long text sequence of information to be extracted, wherein the original long text sequence includes a fund announcement long text.

Date Recue/Date Received 2022-02-08

[0058] Specifically, the text to be extracted here mainly includes such disclosure texts of the types relevant to fund information disclosure prospectuses and fund contracts obtained from official websites that make public disclosure information. As should be noted here, the disclosure texts of the types relevant to fund information disclosure prospectuses and fund contracts in the embodiments of the present invention are merely by way of example, and do not constitute any restriction in the embodiments of the present invention, besides the aforementioned long texts, the method provided by the embodiments of the present invention is also applicable to information extraction of other long texts having fixed directory structures.

[0059] Step 2 ¨ configuring an extracting rule for information extraction of the announcement long text.

[0060] Specifically, this process is mainly directed to infusing an extracting rule to subsequent steps. The extracting rule includes, but is not limited to, regular statements of configuration files and external artificial rule citations, in which regular expressions of identical extracting fields can be used in superimposition, and the external artificial rule citations are mainly used to configure such information as fields required to be extracted by the user; during specific implementation, an external artificial rule citation can be imported in a form file format, and can also be configured via a backstage operation and maintenance platform. As should be noted here, in the embodiments of the present invention, the extracting rule is embodied in the mode of a combination of multiple rules, thus making it possible to effectively enhance the efficiency and precision of long text information extraction.

[0061] Step 3 ¨ positioning a chapter in which directory information locates according to a file directory of the announcement long text, and generating chapter information.
Date Recue/Date Received 2022-02-08

[0062] Specifically, what the method provided by the embodiments of the present invention processes are mainly long texts having fixed directory structures, and the directory information is usually the title information of each chapter. As a preferred example, while a chapter in which the directory information locates is being positioned, the directory information can be used as an extracting field to automatically position and extract the chapter in which the field locates through a preset screening function of chapter positioning, and to generate corresponding chapter information, which includes the title and the entire content of the chapter. During specific implementation, chapter blocks in Chinese documents can usually be positioned through regular expressions.

[0063] Step 4 ¨ partitioning the chapter information, and generating corresponding paragraph lists and sentence lists.

[0064] Specifically, the chapter information generated in the aforementioned step is further finely processed into paragraph text blocks and sentence text blocks, and paragraph lists and sentence lists are respectively generated. During specific implementation, while paragraphs are being partitioned, it is possible to segment the text inside the chapter into paragraphs according to paragraph features. Features of Chinese paragraphs include, but are not limited to, a blank at the end of a paragraph line and an indent at the start of a paragraph line, etc. While sentence partition is being performed, a previously generated paragraph is further extracted according to sentence features, and the paragraph is re-partitioned into sentences. Sentence features include, but are not limited to, a full stop, an exclamatory mark, etc.

[0065] Step 5 ¨ performing information extraction on the paragraph lists and the sentence lists according to the extracting rule configured in step 2, and obtaining key-value pair information to which the announcement long text corresponds.

[0066] Specifically, here the key in the key-value pair information is an extracting field defined Date Recue/Date Received 2022-02-08 in the extracting rule, and the value is relevant information extracted from the paragraph lists and the sentence lists in accordance with the extracting field and according to the extracting rule. During specific extraction, different pieces of information can employ different extracting modes. For instance, partial values in fund generalization information are text descriptive information, so such information may be one paragraph, several sentences, or one sentence, and the concept of adjacent sentence (or paragraph) is introduced here for searching. For instance, when a certain paragraph or sentence to which the extracting field corresponds ends with a colon, usually the content following the colon is the information required to be extracted, at this time the paragraph or sentence following the colon is taken as the extracted target information. When the value required to be extracted is also short information of a specific type, for instance, the information required to be extracted is a date, while such information is usually intermingled in a sentence, it is possible at this time to employ the mode of target detection to extract the relevant information contained in the sentence.

[0067] Step 6 ¨ denoising the key-value pair information, and obtaining the key-value pair information thus processed.

[0068] Specifically, a series of output values (namely the key-value pair information) can be obtained through the foregoing steps. Although the paragraph to which the information corresponds has been precisely positioned, some noises might be present in these output values, even some cases of disorderly extraction might appear. In order to solve this problem, a numerical value denoising filter is introduced in the embodiments of the present invention to further purify redundant or unreasonable results. The denoising process includes, but is not limited to, numerical value type verification (for in-sentence numerical value cleaning), numerical value cutoff extraction (for use in inter-sentence information), etc., to which no explanation is redundantly made in this context.

[0069] Step 7 ¨ subjecting the denoised key-value pair information to manual check and Date Recue/Date Received 2022-02-08 verification, and storing the manually checked and verified key-value pair information into a database.

[0070] Specifically, the manually checked and verified information can serve as fund foundation information supplying a series of fund diagnostic and screening bases and providing data support for internal and external platforms.

[0071] Specifically, during specific implementation of the aforementioned steps, a PySpark big data task can be deployed on a pre-constructed big data cloud platform for routinely incrementally processing fund information extracting tasks, and output results are stored in a Hive table, whereby is made possible to analyze and probe the long text in real time within the order of several minutes.

[0072] Specifically, as shown in Fig. 2, taking a fund status change announcement text for example, a fund status change information extracting method is further provided in the embodiments of the present invention, and the process thereof includes the following steps.

[0073] Step RO ¨ extracting a corresponding sentence list from a fund status change announcement text, analyzing the extracted sentence list and an announcement title of the status change announcement text, and obtaining an analyzing result.

[0074] Specifically, the specific process of extracting a sentence list from a fund status change announcement text can be inferred from the specific contents of the foregoing steps 1 to 4, and no repetition is made in this context. As should also be noted here, the fund status announcement text in the embodiments of the present invention is merely by way of example, and does not constitute any restriction in the embodiments of the present invention, besides the fund status change announcement text, the status change information extracting method provided by the embodiments of the present invention is Date Recue/Date Received 2022-02-08 also applicable to information extraction of other long texts having fixed directory structures.

[0075] Specifically, since the titles of some announcement texts also contain information required to be extracted, the title of the announcement should be taken together into consideration while status change is being analyzed. For instance, the title of a certain announcement text reads "Announcement Relating to Temporary Stop of Large-sum Subscription, Fixed Investment and Conversion & Transfer Businesses of xx Money Market Fund", the "Temporary Stop of Large-sum Subscription, Fixed Investment and Conversion & Transfer Businesses" in the title is also information required to be extracted.

[0076] Step R1 ¨ performing action extraction on the analyzing result, and obtaining action information.

[0077] Specifically, in the embodiments of the present invention, action information includes action type words appearing in the announcement text and the title. The action type words here are mainly differentiated according to the part of speech, and action type words relevant to business status change frequently seen in the field of finance include open, temporarily stop, restore and restrict, etc.

[0078] Step R2 ¨ performing business extraction on the analyzing result, and obtaining business information.

[0079] Specifically, in the embodiments of the present invention, business information includes such nouns of business properties appearing in the announcement text and the title as subscription, redemption, fixed investment, conversion and transfer, etc.
Since business change involves status and sum of money, business nouns are further combined with some modifiers in the embodiments of the present invention to come up with new business terms, as such status change similar to "large-sum redemption". In addition, business Date Recue/Date Received 2022-02-08 further involves some common phrases, bynames and abbreviations, and these will also be uniformly replaced at this step in the embodiments of the present invention.

[0080] Step R3 ¨ generating status change information according to the action information and the business information.

[0081] Specifically, the action and business phrases extracted and obtained in the foregoing steps are arranged and combined, and matched with enumerated values of the status change to obtain a completed change list (namely status change information).

[0082] Step R4 ¨ verifying the status change information, and storing the verified information into a database.

[0083] Specifically, when the status change information is stored in the database, storage can likewise be effected by the mode of key-value pair, during specific implementation, the "status change" field is taken as the key, and the specific status change information as extracted is taken as the value.

[0084] Specifically, in the embodiments of the present invention, the reason why to divide into action and business in the change status is because actions or businesses in the statuses are described considerably elliptically in such long texts as fund disclosure texts (for instance, "temporary stop of subscription, redemption" actually expresses two status changes of temporary stop of subscription and temporary stop of redemption).
The division can effectively alleviate the above circumstance, and even avoid the circumstance of incomplete extraction due to information dislocation and information ellipsis.

[0085] Embodiment 2 Date Recue/Date Received 2022-02-08

[0086] Fig. 3 is a flowchart illustrating a text information extracting method according to an exemplary embodiment. With reference to Fig. 3, the method comprises the following steps.

[0087] Si ¨ obtaining a text to be extracted and an extracting rule corresponding to the text to be extracted, wherein the extracting rule includes an extracting field.

[0088] Specifically, the text to be extracted includes, but is not limited to, such long texts having fixed directory structures as fund information disclosure prospectuses and fund contracts.
As should be noted here, the information extracting method provided by the embodiments of the present invention is further applicable to information extraction of other long texts with relatively standard structures and styles. The extracting rule includes regular statements of configuration files and self-defined rules, the self-defined rules are mainly used to configure such information as fields required to be extracted by the user, and the self-defined rules can be adjusted according to practical requirements of the user, so as to adapt to different information extracting requirements. The extracting rule is embodied in the mode of a combination of multiple rules, thus making it possible to effectively enhance the efficiency and precision of long text information extraction.

[0089] S2 ¨determining, based on a file directory of the text to be extracted, a chapter location of each piece of directory information of the file directory in the text to be extracted, and generating chapter information.

[0090] Specifically, some documents usually have relatively fixed template structures, such as the directory structure, etc. In the embodiments of the present invention, the file directory inherent in the text to be extracted is utilized to precisely position the text to the levels of chapter, paragraph and sentence through the directory hierarchy positioning mode, and to prepare for subsequent information extraction. While chapter positioning is being performed, the directory information can be used as an extracting field (the directory Date Recue/Date Received 2022-02-08 information is usually the title of each chapter) to automatically position and extract the chapter in which the field locates through the mode of regular expression, and to generate corresponding chapter information. The chapter information includes the title and the corresponding entire content of the chapter in the embodiments of the present invention.

[0091] S3 ¨ partitioning the chapter information according to a preset rule, and generating a corresponding partition list.

[0092] Specifically, in order to enhance precision of information extraction, after the chapter of the text to be extracted has been positioned, it is further required to further finely partition the chapter information, by firstly partitioning the chapter information to which each chapter corresponds into paragraphs, thereafter sequentially partitioning the paragraphs each into sentences, and generating paragraph lists and sentence lists respectively according to the partitioning result for use by subsequent steps. The specific partitioning process can be inferred from the descriptions to the related steps in Embodiment 1, while no repetition is made in this context.

[0093] S4 ¨ generating key-value pair information corresponding to the text to be extracted according to the partition list and the extracting rule and storing the information into a database, wherein the key includes an extracting field, and the value includes the partition list and target information corresponding to the extracting field.

[0094] Specifically, information extraction is performed on the partition list obtained in the foregoing step in accordance with the extracting field and according to the extracting rule to obtain information required to be extracted, key-value pair information is then generated from the extracting field and the extracted information, and the key-value pair information is stored into a database.

[0095] As a preferred mode of execution in the embodiments of the present invention, the Date Recue/Date Received 2022-02-08 partition list includes paragraph lists and sentence lists, and the step of partitioning the chapter information according to a preset rule, and generating a corresponding partition list includes:

[0096] partitioning each piece of the chapter information into paragraphs according to preset paragraph features, and generating corresponding paragraph lists respectively;
and

[0097] partitioning each paragraph in each paragraph list into sentences according to preset sentence features, and generating corresponding sentence lists respectively.

[0098] Specifically, the processes of partitioning into paragraphs and partitioning into sentences can be inferred from the descriptions to the related steps in Embodiment 1, while no repetition is made in this context.

[0099] As a preferred mode of execution in the embodiments of the present invention, when the target information corresponding to the extracting field is long text information, the step of generating key-value pair information corresponding to the text to be extracted according to the partition list and the extracting rule and storing the information into a database includes:

[0100] determining a first paragraph or a first sentence in the partition list in which the extracting field locates, and determining a second paragraph adjacent to the first paragraph or a second sentence adjacent to the first sentence;

[0101] searching for the first paragraph and the second paragraph or the first sentence and the second sentence by use of a preset searching rule, and determining the target information corresponding to the extracting field; and

[0102] generating key-value pair information corresponding to the text to be extracted according to the extracting field and the target information and storing the information into a database.

[0103] Specifically, when the target information corresponding to the extracting field is long text information, such as in fund generalization information, partial values are text descriptive Date Recue/Date Received 2022-02-08 information, so such information may be one paragraph, several sentences, or one sentence, and the concept of adjacent sentence (or paragraph) can be introduced here for searching, namely to take the adjacent sentence or paragraph also in the range of consideration. For instance, when a certain paragraph or sentence to which the extracting field corresponds ends with a colon, usually the content following the colon is the information required to be extracted, at this time the paragraph or sentence following the colon is taken as the extracted target information.

[0104] As a preferred mode of execution in the embodiments of the present invention, when the target information corresponding to the extracting field is short text information, the step of generating key-value pair information corresponding to the text to be extracted according to the partition list and the extracting rule and storing the information into a database includes:

[0105] performing a target detection process on the sentences in the partition list, and obtaining the target information corresponding to the extracting field; and

[0106] generating key-value pair information corresponding to the text to be extracted according to the extracting field and the target information and storing the information into a database.

[0107] Specifically, when the target information corresponding to the extracting field is short text information, for instance, the information required to be extracted is a date, while such information is usually intermingled in a sentence, it is possible at this time to employ the mode of target detection to extract the relevant information contained in the sentence.

[0108] As a preferred mode of execution in the embodiments of the present invention, when the extracting field is status change, the step of generating key-value pair information corresponding to the text to be extracted according to the partition list and the extracting rule and storing the information into a database includes:

[0109] obtaining business status change information in the sentence lists according to the Date Recue/Date Received 2022-02-08 extracting rule, and generating key-value pair information corresponding to the text to be extracted according to the business status change information and the extracting field and storing the information into a database.

[0110] Specifically, the status change information extracting process can be inferred from the fund status change information extracting process in Embodiment 1, while no repetition is made in this context.

[0111] As a preferred mode of execution in the embodiments of the present invention, prior to storing the key-value pair information into a database, the method further comprises:

[0112] denoising the key-value pair information, and storing the denoised key-value pair information in the database.

[0113] Specifically, in order to enhance precision of information extraction, the key-value pair information generated in the foregoing step is further processed by being filtered in the embodiments of the present invention, during specific implementation, it is possible to denoise the key-value pair information to remove redundant or unreasonable results.

[0114] As a preferred mode of execution in the embodiments of the present invention, the extracting rule includes a regular expression.

[0115] Fig. 4 is a view schematically illustrating the structure of a text information extracting device according to an exemplary embodiment. With reference to Fig. 4, the device comprises:

[0116] a data obtaining module, for obtaining a text to be extracted and an extracting rule corresponding to the text to be extracted, wherein the extracting rule includes an extracting field;

[0117] a chapter obtaining module, for determining, based on a file directory of the text to be extracted, a chapter location of each piece of directory information of the file directory in Date Recue/Date Received 2022-02-08 the text to be extracted, and generating chapter information;

[0118] a data partitioning module, for partitioning the chapter information according to a preset rule, and generating a corresponding partition list; and

[0119] an information generating module, for generating key-value pair information corresponding to the text to be extracted according to the partition list and the extracting rule and storing the information into a database, wherein the key includes an extracting field, and the value includes the partition list and information corresponding to the extracting field.

[0120] As a preferred mode of execution in the embodiments of the present invention, the data partitioning module includes:

[0121] a paragraph partitioning unit, for partitioning each piece of the chapter information into paragraphs according to preset paragraph features, and generating corresponding paragraph lists respectively; and

[0122] a sentence partitioning unit, for partitioning each paragraph in each paragraph list into sentences according to preset sentence features, and generating corresponding sentence lists respectively.

[0123] As a preferred mode of execution in the embodiments of the present invention, the information generating module is specifically employed for:

[0124] determining a first paragraph or a first sentence in the partition list in which the extracting field locates, and determining a second paragraph adjacent to the first paragraph or a second sentence adjacent to the first sentence;

[0125] searching for the first paragraph and the second paragraph or the first sentence and the second sentence by use of a preset searching rule, and determining the target information corresponding to the extracting field; and

[0126] generating key-value pair information corresponding to the text to be extracted according to the extracting field and the target information and storing the information into a database.

Date Recue/Date Received 2022-02-08

[0127] As a preferred mode of execution in the embodiments of the present invention, the information generating module is further employed for:

[0128] performing a target detection process on the sentences in the partition list, and obtaining the target information corresponding to the extracting field; and

[0129] generating key-value pair information corresponding to the text to be extracted according to the extracting field and the target information and storing the information into a database.

[0130] As a preferred mode of execution in the embodiments of the present invention, the information generating module is further employed for:

[0131] obtaining business status change information in the sentence lists according to the extracting rule, and generating key-value pair information corresponding to the text to be extracted according to the business status change information and the extracting field and storing the information into a database.

[0132] As a preferred mode of execution in the embodiments of the present invention, the device further comprises:

[0133] a denoising module, for denoising the key-value pair information, and storing the denoised key-value pair information in the database.

[0134] As a preferred mode of execution in the embodiments of the present invention, the extracting rule includes a regular expression.

[0135] Fig. 5 is a view schematically illustrating the internal structure of a computer equipment according to an exemplary embodiment. With reference to Fig. 5, the computer equipment comprises a processor, a memory and a network interface connected to each other via a system bus. The processor of the computer equipment is employed to provide computing and controlling capabilities. The memory of the computer equipment includes a Date Recue/Date Received 2022-02-08 nonvolatile storage medium, and an internal memory. The nonvolatile storage medium stores therein an operating system, a computer program and a database. The internal memory provides environment for the running of the operating system and the computer program in the nonvolatile storage medium. The network interface of the computer equipment is employed to connect to an external terminal via network for communication.
The computer program realizes a method of optimizing an execution plan when it is executed by a processor.

[0136] As understandable to persons skilled in the art, the structure illustrated in Fig. 5 is merely a block diagram of partial structure relevant to the solution of the present invention, and does not constitute any restriction to the computer equipment on which the solution of the present invention is applied, as the specific computer equipment may comprise component parts that are more than or less than those illustrated in Fig. 5, or may combine certain component parts, or may have different layout of component parts.

[0137] As a preferred mode of execution in the embodiments of the present invention, the computer equipment comprises a memory, a processor and a computer program stored on the memory and operable on the processor, and the following steps are realized when the processor executes the computer program:

[0138] obtaining a text to be extracted and an extracting rule corresponding to the text to be extracted, wherein the extracting rule includes an extracting field;

[0139] determining, based on a file directory of the text to be extracted, a chapter location of each piece of directory information of the file directory in the text to be extracted, and generating chapter information;

[0140] partitioning the chapter information according to a preset rule, and generating a corresponding partition list; and

[0141] generating key-value pair information corresponding to the text to be extracted according to the partition list and the extracting rule and storing the information into a database, wherein the key includes an extracting field, and the value includes the partition list and Date Recue/Date Received 2022-02-08 target information corresponding to the extracting field.

[0142] As a preferred mode of execution in the embodiments of the present invention, when the processor executes the computer program, the following steps are further realized:

[0143] partitioning each piece of the chapter information into paragraphs according to preset paragraph features, and generating corresponding paragraph lists respectively;
and

[0144] partitioning each paragraph in each paragraph list into sentences according to preset sentence features, and generating corresponding sentence lists respectively.

[0145] As a preferred mode of execution in the embodiments of the present invention, when the processor executes the computer program, the following steps are further realized:

[0146] determining a first paragraph or a first sentence in the partition list in which the extracting field locates, and determining a second paragraph adjacent to the first paragraph or a second sentence adjacent to the first sentence;

[0147] searching for the first paragraph and the second paragraph or the first sentence and the second sentence by use of a preset searching rule, and determining the target information corresponding to the extracting field; and

[0148] generating key-value pair information corresponding to the text to be extracted according to the extracting field and the target information and storing the information into a database.

[0149] As a preferred mode of execution in the embodiments of the present invention, when the processor executes the computer program, the following steps are further realized:

[0150] performing a target detection process on the sentences in the partition list, and obtaining the target information corresponding to the extracting field; and

[0151] generating key-value pair information corresponding to the text to be extracted according to the extracting field and the target information and storing the information into a database.

Date Recue/Date Received 2022-02-08

[0152] As a preferred mode of execution in the embodiments of the present invention, when the processor executes the computer program, the following steps are further realized:

[0153] obtaining business status change information in the sentence lists according to the extracting rule, and generating key-value pair information corresponding to the text to be extracted according to the business status change information and the extracting field and storing the information into a database.

[0154] As a preferred mode of execution in the embodiments of the present invention, when the processor executes the computer program, the following steps are further realized:

[0155] denoising the key-value pair information, and storing the denoised key-value pair information in the database.

[0156] As a preferred mode of execution in the embodiments of the present invention, the extracting rule includes a regular expression.

[0157] In the embodiments of the present invention, there is further provided a computer-readable storage medium storing a computer program thereon, and the following steps are realized when the computer program is executed by a processor:

[0158] obtaining a text to be extracted and an extracting rule corresponding to the text to be extracted, wherein the extracting rule includes an extracting field;

[0159] determining, based on a file directory of the text to be extracted, a chapter location of each piece of directory information of the file directory in the text to be extracted, and generating chapter information;

[0160] partitioning the chapter information according to a preset rule, and generating a corresponding partition list; and

[0161] generating key-value pair information corresponding to the text to be extracted according to the partition list and the extracting rule and storing the information into a database, wherein the key includes an extracting field, and the value includes the partition list and target information corresponding to the extracting field.
Date Recue/Date Received 2022-02-08

[0162] As a preferred mode of execution in the embodiments of the present invention, when the computer program is executed by a processor, the following steps are further realized:

[0163] obtaining a text to be extracted and an extracting rule corresponding to the text to be extracted, wherein the extracting rule includes an extracting field;

[0164] determining, based on a file directory of the text to be extracted, a chapter location of each piece of directory information of the file directory in the text to be extracted, and generating chapter information;

[0165] partitioning the chapter information according to a preset rule, and generating a corresponding partition list; and

[0166] generating key-value pair information corresponding to the text to be extracted according to the partition list and the extracting rule and storing the information into a database, wherein the key includes an extracting field, and the value includes the partition list and target information corresponding to the extracting field.

[0167] As a preferred mode of execution in the embodiments of the present invention, when the computer program is executed by a processor, the following steps are further realized:

[0168] determining a first paragraph or a first sentence in the partition list in which the extracting field locates, and determining a second paragraph adjacent to the first paragraph or a second sentence adjacent to the first sentence;

[0169] searching for the first paragraph and the second paragraph or the first sentence and the second sentence by use of a preset searching rule, and determining the target information corresponding to the extracting field; and

[0170] generating key-value pair information corresponding to the text to be extracted according to the extracting field and the target information and storing the information into a database.

[0171] As a preferred mode of execution in the embodiments of the present invention, when the computer program is executed by a processor, the following steps are further realized:

Date Recue/Date Received 2022-02-08

[0172] performing a target detection process on the sentences in the partition list, and obtaining the target information corresponding to the extracting field; and

[0173] generating key-value pair information corresponding to the text to be extracted according to the extracting field and the target information and storing the information into a database.

[0174] As a preferred mode of execution in the embodiments of the present invention, when the computer program is executed by a processor, the following steps are further realized:

[0175] obtaining business status change information in the sentence lists according to the extracting rule, and generating key-value pair information corresponding to the text to be extracted according to the business status change information and the extracting field and storing the information into a database.

[0176] As a preferred mode of execution in the embodiments of the present invention, when the computer program is executed by a processor, the following steps are further realized:

[0177] denoising the key-value pair information, and storing the denoised key-value pair information in the database.

[0178] As a preferred mode of execution in the embodiments of the present invention, the extracting rule includes a regular expression.

[0179] In short, the technical solutions provided by the embodiments of the present invention bring about the following advantageous effects.

[0180] In the text information extracting method, and corresponding device, computer equipment and storage medium provided by the embodiments of the present invention, by obtaining a text to be extracted and an extracting rule corresponding to the text to be extracted, wherein the extracting rule includes an extracting field, determining, based on a file directory of the text to be extracted, a chapter location of each piece of directory Date Recue/Date Received 2022-02-08 information of the file directory in the text to be extracted, and generating chapter information, partitioning the chapter information according to a preset rule, and generating a corresponding partition list, and generating key-value pair information corresponding to the text to be extracted according to the partition list and the extracting rule and storing the information into a database, wherein the key includes an extracting field, and the value includes the partition list and target information corresponding to the extracting field, on the one hand, the present invention enhances the efficiency of text extraction, avoids such problems as missing from and error in information extraction, and enhances the precision of text extraction, on the other hand, by dividing the long text, the present invention avoids the circumstance of infinite backtracking that might be encountered in regular matching, enhances fault tolerance rate of codes, and reduces time consumption in the overall operation.

[0181] In the text information extracting method, and corresponding device, computer equipment and storage medium provided by the embodiments of the present invention, by partitioning each piece of the chapter information into paragraphs according to preset paragraph features, and generating corresponding paragraph lists respectively, and partitioning each paragraph in each paragraph list into sentences according to preset sentence features, and generating corresponding sentence lists respectively, the text is precisely positioned to the levels of chapter, paragraph and sentence through the directory hierarchy positioning mode, so that to effect precise positioning and to extract relevant information from the text to be extracted.

[0182] In the text information extracting method, and corresponding device, computer equipment and storage medium provided by the embodiments of the present invention, by denoising the key-value pair information and storing the denoised key-value pair information in the database, the key-value pair information extracted from the text is further screened and filtered, whereby the precision in long text information extraction is effectively enhanced.

Date Recue/Date Received 2022-02-08

[0183] As should be noted, when the text information extracting device provided by the aforementioned embodiment triggers an extracting business, it is merely exemplarily described with its division into the aforementioned various functional modules, whereas in actual application it is possible to base on requirements to assign the aforementioned functions to different functional modules for completion, that is to say, the internal structure of the device is divided into different functional modules to complete the entire or partial functions as described above. In addition, the text information extracting device provided by the aforementioned embodiment pertains to the same inventive conception as the text information extracting method, in other words, the device is based on the text information extracting method ¨ see the method embodiment for its specific implementation process, while no repetition will be made in this context.

[0184] As comprehensible to persons ordinarily skilled in the art, the entire or partial steps in the aforementioned embodiments can be completed via hardware, or via a program instructing relevant hardware, the program can be stored in a computer-readable storage medium, and the storage medium can be a read-only memory, a magnetic disk or an optical disk, etc.

[0185] The foregoing embodiments are merely preferred embodiments of the present invention, and they are not to be construed as restrictive to the present invention. Any amendment, equivalent substitution, and improvement makeable within the spirit and principle of the present invention shall all fall within the protection scope of the present invention.

Date Recue/Date Received 2022-02-08

Claims

What is claimed is:

1. A text information extracting method, characterized in that the method comprises the following steps:
obtaining a text to be extracted and an extracting rule corresponding to the text to be extracted, wherein the extracting rule includes an extracting field;
determining, based on a file directory of the text to be extracted, a chapter location of each piece of directory information of the file directory in the text to be extracted, and generating chapter information;
partitioning the chapter information according to a preset rule, and generating a corresponding partition list; and generating key-value pair information corresponding to the text to be extracted according to the partition list and the extracting rule and storing the information into a database, wherein the key includes an extracting field, and the value includes the partition list and target information corresponding to the extracting field.

2. The text information extracting method according to Claim 1, characterized in that the partition list includes paragraph lists and sentence lists, and that the step of partitioning the chapter information according to a preset rule, and generating a corresponding partition list includes:
partitioning each piece of the chapter information into paragraphs according to preset paragraph features, and generating corresponding paragraph lists respectively; and partitioning each paragraph in each paragraph list into sentences according to preset sentence features, and generating corresponding sentence lists respectively.

3. The text information extracting method according to Claim 1 or 2, characterized in that, when the target information corresponding to the extracting field is long text information, the Date Recue/Date Received 2022-02-08 step of generating key-value pair information corresponding to the text to be extracted according to the partition list and the extracting rule and storing the information into a database includes:
determining a first paragraph or a first sentence in the partition list in which the extracting field locates, and determining a second paragraph adjacent to the first paragraph or a second sentence adjacent to the first sentence;
searching for the first paragraph and the second paragraph or the first sentence and the second sentence by use of a preset searching rule, and determining the target information corresponding to the extracting field; and generating key-value pair information corresponding to the text to be extracted according to the extracting field and the target information and storing the information into a database.

4. The text information extracting method according to Claim 1 or 2, characterized in that, when the target information corresponding to the extracting field is short text information, the step of generating key-value pair information corresponding to the text to be extracted according to the partition list and the extracting rule and storing the information into a database includes:
performing a target detection process on the sentences in the partition list, and obtaining the target information corresponding to the extracting field; and generating key-value pair information corresponding to the text to be extracted according to the extracting field and the target information and storing the information into a database.

5. The text information extracting method according to Claim 2, characterized in that, when the extracting field is status change, the step of generating key-value pair information corresponding to the text to be extracted according to the partition list and the extracting rule and storing the information into a database includes:
obtaining business status change information in the sentence lists according to the extracting rule, and generating key-value pair information corresponding to the text to be extracted according to the business status change information and the extracting field and storing the information into a database.

Date Recue/Date Received 2022-02-08

6. The text information extracting method according to Claim 1 or 2, characterized in further comprising, prior to the step of storing the key-value pair information into a database:
denoising the key-value pair information, and storing the denoised key-value pair information into the database.

7. The text information extracting method according to Claim 1 or 2, characterized in that the extracting rule includes a regular expression.

8. A text information extracting device, characterized in that the device comprises:
a data obtaining module, for obtaining a text to be extracted and an extracting rule corresponding to the text to be extracted, wherein the extracting rule includes an extracting field;
a chapter obtaining module, for determining, based on a file directory of the text to be extracted, a chapter location of each piece of directory information of the file directory in the text to be extracted, and generating chapter information;
a data partitioning module, for partitioning the chapter information according to a preset rule, and generating a corresponding partition list; and an information generating module, for generating key-value pair information corresponding to the text to be extracted according to the partition list and the extracting rule and storing the information into a database, wherein the key includes an extracting field, and the value includes the partition list and information corresponding to the extracting field.

9. A computer equipment, comprising a memory, a processor and a computer program stored on the memory and operable on the processor, characterized in that the method steps according to anyone of Claims 1 to 7 are realized when the processor executes the computer program.

10. A computer-readable storage medium, storing a computer program thereon, characterized in that the method steps according to anyone of Claims 1 to 7 are realized when the computer program is executed by a processor.

Date Recue/Date Received 2022-02-08