CN111460151A - Method, device and equipment for classifying bulletin text formats - Google Patents

Method, device and equipment for classifying bulletin text formats Download PDF

Info

Publication number
CN111460151A
CN111460151A CN202010231944.5A CN202010231944A CN111460151A CN 111460151 A CN111460151 A CN 111460151A CN 202010231944 A CN202010231944 A CN 202010231944A CN 111460151 A CN111460151 A CN 111460151A
Authority
CN
China
Prior art keywords
text
bulletin
processed
paragraph
sequence number
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010231944.5A
Other languages
Chinese (zh)
Inventor
王愈
张盛
陈强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Valueonline Technology Co ltd
Original Assignee
Shenzhen Valueonline Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Valueonline Technology Co ltd filed Critical Shenzhen Valueonline Technology Co ltd
Priority to CN202010231944.5A priority Critical patent/CN111460151A/en
Publication of CN111460151A publication Critical patent/CN111460151A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Abstract

The application is applicable to the technical field of data processing, and provides a classification method, a classification device, classification equipment and a computer readable storage medium for bulletin text formats, wherein the method comprises the following steps: acquiring a bulletin text to be processed; carrying out structuralization processing on the bulletin text to be processed to obtain a structuralization bulletin text; extracting a text content sequence number from the structured bulletin text according to a preset regular condition; and obtaining the format type of the bulletin text to be processed based on the text content sequence number. Therefore, the structural processing is carried out on the bulletin texts to be processed, so that the bulletin texts to be processed can be classified by extracting the content from the structural public texts in a targeted manner according to the preset regularization condition, the problems that all extraction conditions are difficult to cover and condition conflicts are easy to generate in the process of directly classifying the formats of the three bulletin texts in the prior art in a character string regularization processing mode are solved, and the effect of accurately classifying the bulletin texts is achieved.

Description

Method, device and equipment for classifying bulletin text formats
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device and a computer readable storage medium for classifying a bulletin text format.
Background
Currently, marketing companies in the domestic capital market need to disclose company information to the public, particularly bulletin texts disclosing three parties (board of directors, prison and stockholder's meeting). Accordingly, the witness has clear specifications and requirements for the disclosure of the three-meeting bulletin text, resulting in many listed companies being penalized for the witness because of the lack of specifications and non-compliance of the three-meeting bulletin text disclosure. It can be seen that the marketing company itself needs to research the three bulletin texts disclosed by other marketing companies in the domestic capital market, and extract valuable information and nonstandard bulletin text contents with reference significance from the three bulletin texts.
Because the format of the three-party announcement text is complex, such as a standard guide format, a serial number protocol format or a text paragraph format, in the research of the three-party announcement text disclosed by other listed companies in the domestic capital market, generally, the format of the three-party announcement text is classified first, and then the subsequent processing is performed according to the classified three-party announcement text, so that valuable announcement text content and non-standard announcement text content with reference significance are extracted from the three-party announcement text. However, in the prior art, it is also necessary to manually perform data review on the advertisement text content extracted from the advertisement text to ensure the accuracy of the extracted advertisement text content. However, the manual data review process of the announcement text content extracted from the announcement text has the problems of low extraction accuracy, low extraction efficiency and the like.
Disclosure of Invention
The embodiment of the application provides a classification method and a classification device for bulletin text formats, which can solve the problems that in the prior art, bulletin texts are difficult to cover all extraction conditions and are easy to conflict with the conditions in the classification process.
In a first aspect, an embodiment of the present application provides a method for classifying a bulletin text format, including:
acquiring a bulletin text to be processed;
carrying out structuralization processing on the bulletin text to be processed to obtain a structuralization bulletin text;
extracting a text content sequence number from the structured bulletin text according to a preset regular condition;
and obtaining the format type of the bulletin text to be processed based on the text content sequence number.
In a possible implementation manner of the first aspect, the obtaining the to-be-processed advertisement text includes:
acquiring a target bulletin file;
analyzing the bulletin text to be processed from the target bulletin file according to a preset text analyzer; or, performing character recognition on the target bulletin file, and forming a bulletin text to be processed based on the character recognition result.
In a possible implementation manner of the first aspect, the structuring the bulletin text to be processed to obtain a structured bulletin text includes:
searching paragraph characteristics of the bulletin text to be processed;
according to the paragraph characteristics, paragraph division is carried out on the bulletin text to be processed, so that paragraph texts of the bulletin text to be processed are obtained;
and structuring the text content of the bulletin text to be processed into a grid table by taking each paragraph text as a unit grid to obtain a structured bulletin text.
In a possible implementation manner of the first aspect, obtaining the format type of the to-be-processed advertisement text based on the text content sequence number includes:
obtaining a directory structure of the bulletin text to be processed according to the text content sequence number;
and carrying out keyword identification on the directory structure to obtain the format type of the bulletin text to be processed.
In a possible implementation manner of the first aspect, the obtaining a directory structure of the to-be-processed advertisement text according to the text content sequence number includes:
acquiring a hierarchy matched with the text content sequence number;
extracting paragraph texts corresponding to the text content sequence numbers from the structural bulletin texts;
and forming a directory structure of the bulletin text to be processed according to the text content sequence number, the hierarchy matched with the text content sequence number and the paragraph text corresponding to the text content sequence number.
In a possible implementation manner of the first aspect, performing keyword recognition on the directory structure to obtain a format type of the to-be-processed bulletin text includes:
screening out a target level of the directory structure;
and obtaining the format type of the bulletin text to be processed according to the result of matching the paragraph text corresponding to the target level with the preset keywords.
In a possible implementation manner of the first aspect, before extracting a text content sequence number from the structured advertisement text according to a preset regularization condition, the method further includes:
obtaining a notice text sample;
identifying text content sequence number sample characteristics of the bulletin text samples;
and carrying out preset logic algorithm processing on the text content serial number sample characteristics to obtain the preset regular condition.
In a second aspect, an embodiment of the present application provides an apparatus for classifying a bulletin text format, including:
the acquisition module is used for acquiring the bulletin text to be processed;
the structure processing module is used for carrying out structural processing on the bulletin text to be processed to obtain a structural bulletin text;
the extraction module is used for extracting a text content serial number from the structural bulletin text according to a preset regular condition;
and the identification module is used for obtaining the format type of the bulletin text to be processed based on the text content sequence number.
In a possible implementation manner of the second aspect, the obtaining module includes:
the acquisition submodule is used for acquiring a target bulletin file;
the analysis submodule is used for analyzing the bulletin text to be processed from the target bulletin file according to a preset text analyzer;
the acquisition module further comprises:
and the identification module is used for carrying out character identification on the target bulletin file and forming a bulletin text to be processed based on the character identification result.
In one possible implementation manner of the second aspect, the structuring module includes:
the search submodule is used for searching paragraph characteristics of the bulletin text to be processed;
the division submodule is used for carrying out paragraph division on the bulletin text to be processed according to the paragraph characteristics to obtain the paragraph text of the bulletin text to be processed;
and the structuring submodule is used for structuring the text content of the bulletin text to be processed into a grid table by taking each paragraph text as a unit grid to obtain the structured bulletin text.
In one possible implementation manner of the second aspect, the identification module includes:
the directory structure query submodule is used for obtaining the directory structure of the bulletin text to be processed according to the text content sequence number;
and the keyword identification submodule is used for carrying out keyword identification on the directory structure to obtain the format type of the bulletin text to be processed.
In a possible implementation manner of the second aspect, the directory structure query submodule includes:
the acquisition unit is used for acquiring the hierarchy matched with the text content sequence number;
the extracting unit is used for extracting paragraph texts corresponding to the text content sequence numbers from the structural bulletin texts;
and the forming unit is used for forming a directory structure of the bulletin text to be processed according to the text content serial number, the hierarchy matched with the text content serial number and the paragraph text corresponding to the text content serial number.
In one possible implementation manner of the second aspect, the keyword recognition sub-module includes:
the screening unit is used for screening out a target hierarchy of the directory structure;
and the matching unit is used for obtaining the format type of the bulletin text to be processed according to the result of matching the paragraph text corresponding to the target level with the preset keywords.
In a third aspect, an embodiment of the present application provides a classification device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor, when executing the computer program, implements the method according to the first aspect.
In a fourth aspect, the present application provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the method according to the first aspect
In a fifth aspect, embodiments of the present application provide a computer program product, which, when run on a sorting apparatus, causes the sorting apparatus to perform the method of any one of the first aspect.
Compared with the prior art, the embodiment of the application has the advantages that:
according to the embodiment of the application, the bulletin text to be processed can be obtained through the classification equipment; carrying out structuralization processing on the bulletin text to be processed to obtain a structuralization bulletin text; extracting a text content sequence number from the structured bulletin text according to a preset regular condition; and obtaining the format type of the bulletin text to be processed based on the text content sequence number. Therefore, the structural processing is carried out on the bulletin texts to be processed, so that the bulletin texts to be processed can be classified by extracting the content from the structural public texts in a targeted manner according to the preset regularization condition, the problems that all extraction conditions are difficult to cover and condition conflicts are easy to generate in the process of directly classifying the formats of the three bulletin texts in the prior art in a character string regularization processing mode are solved, and the effect of accurately classifying the bulletin texts is achieved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a schematic flowchart of a method for classifying a bulletin text format according to an embodiment of the present application;
fig. 2 is a schematic specific flowchart of step S103 in fig. 1 of a method for classifying a bulletin text format according to an embodiment of the present application;
fig. 3 is another schematic flow chart of the method for classifying a bulletin text format according to an embodiment of the present application before step S103 in fig. 1;
fig. 4 is a flowchart illustrating a specific process of step S104 in fig. 1 of the method for classifying a bulletin text format according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a classification device for a bulletin text format according to a second embodiment of the present application;
fig. 6 is a schematic structural diagram of a terminal device according to a third embodiment of the present application;
FIG. 7 is an exemplary diagram of a standard referral format bulletin text provided in an embodiment of the application;
FIG. 8 is a diagram of an exemplary bulletin text in the protocol format of serial number provided in an embodiment of the present application;
fig. 9 is a diagram of an example of a text paragraph format advertisement text provided in an embodiment of the present application.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to" determining "or" in response to detecting ". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".
Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.
Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.
The classification method of the bulletin text format provided by the embodiment of the application can be applied to classification equipment, the classification equipment can be computing equipment such as a desktop computer, a notebook computer, a palm computer and a cloud server, and the embodiment of the application does not limit the specific type of the classification equipment.
The technical solutions provided in the embodiments of the present application will be described below with specific embodiments.
Example one
Referring to fig. 1, a schematic flow chart of a method for classifying a bulletin text format provided in the embodiment of the present application, where the method is specifically applicable to a classification device, includes the following steps:
and step S101, obtaining a bulletin text to be processed.
As an example and not by way of limitation, the classification device may obtain the to-be-processed advertisement text by obtaining a target advertisement file and parsing the to-be-processed advertisement text from the target advertisement file according to a preset text parser. The target announcement text can be a three-party announcement PDF text, and the preset text parser can be a PDF text parser. It is understood that the classification device can fetch the three bulletin PDF texts from the exchange's document library through a preset automatic acquisition program, such as a web crawler program.
In a possible implementation manner, after the classification device obtains the target bulletin file, character recognition may be performed on the target bulletin file, and a bulletin text to be processed is formed based on a result of the character recognition. Specifically, the target bulletin text is subjected to character recognition through an optical recognition mode, such as OCR optical character recognition, so as to obtain a bulletin text to be processed.
And S102, carrying out structural processing on the bulletin text to be processed to obtain a structural bulletin text.
The structural processing refers to formatting the content of the bulletin text to be processed, that is, performing formal classification on the content of the public text to be processed, so that the relevant content can be extracted from the structural public text in a targeted manner.
As an example and not by way of limitation, referring to a specific flowchart schematic diagram of step S103 in fig. 1 shown in fig. 2, the process of performing the structuring process on the to-be-processed bulletin text to obtain the structured bulletin text may specifically be:
step S201, paragraph features of the bulletin text to be processed are searched.
The paragraph features may refer to features between paragraph texts of the common text to be processed, such as line breaks.
Step S202, paragraph division is carried out on the bulletin text to be processed according to the paragraph characteristics, and paragraph texts of the bulletin text to be processed are obtained.
Step S203, taking each paragraph text as a unit grid, structuring the text content of the bulletin text to be processed into a grid table, and obtaining a structured bulletin text.
In a possible implementation manner, each paragraph text in the structured public text is ordered, and a paragraph number corresponding to each paragraph text is obtained in the structured text.
It can be understood that each paragraph text of the structured bulletin text occupies one unit grid, and each paragraph text has a corresponding paragraph serial number, so that the action range of each paragraph text can be distinguished, the information content of the paragraph text can be more accurately positioned, the error list of positioning is obviously reduced, the following can accurately extract and judge the main point content of the format type of the public text to be processed in an accurate range, the format type of the bulletin text to be processed can be accurately judged, and the bulletin text to be processed can be accurately classified.
And S103, extracting a text content serial number from the structured bulletin text according to a preset regular condition.
It should be noted that the preset regular condition refers to a preset condition formula for matching data features in the structured bulletin text; the text content serial number refers to the serial number contained in each paragraph text in the structured bulletin text, such as the Chinese character serial numbers of "one, two, three … …" and "(one), (two), (three) … …" and the numerical serial numbers of "1, 2, 3 …" and "(1), (2), (3) …". By way of example and not limitation, the preset regular conditions of embodiments of the present application may be "[ ((]. The regular expression "[ ((]. The regular expression "; or a bracketed number such as (1) is extracted and is from 1 to 2 bits; or a decimal number such as 1.1 is extracted, and the number of digits of the decimal part of the decimal number is 0 to 2.
In a possible implementation manner, referring to another schematic flow chart shown in fig. 3 before step S103 in fig. 1, before the extracting the text content serial number from the structured advertisement text according to the preset regularization condition, the method further includes:
and step S301, obtaining a bulletin text sample.
Here, the bulletin text sample may refer to a sample that has been classified in a text format. By way of example and not limitation, the classification device may obtain the to-be-posted text sample by obtaining a target posting file and parsing the to-be-posted text sample from the target posting file sample according to a preset text parser. The target announcement text sample may be a PDF text sample of the three parties announcement, and the preset text parser may be a PDF text parser. It is understood that the classification device may capture three bulletin PDF text samples from the Shanghai and Shenshen exchange's document library through a pre-defined automatic acquisition program, such as a web crawler program. Alternatively, the classification device may obtain the bulletin text sample directly from a third party database server.
And step S302, identifying the text content serial number sample characteristics of the bulletin text sample.
The text content number sample characteristic may refer to a characteristic capable of characterizing a text content number of the bulletin text sample, such as a first paragraph character of the bulletin text sample. The above-mentioned bulletin text content serial numbers refer to the chinese character serial numbers such as "one, two, three. (one), (two)" and the like and the number serial numbers such as "1, 2 … (1), (2) …" and the like.
Step S303, after the text content serial number sample characteristics are processed by a preset logic algorithm, a preset regular condition is obtained. The preset logic algorithm may be a mathematical statistics algorithm.
It can be understood that the embodiment of the application can set the regular expression by screening the text content sequence number from the line feed character of the bulletin text sample, so as to prepare for extracting the text content sequence number from the structured bulletin text according to the regular expression.
And step S104, obtaining the format type of the bulletin text to be processed based on the text content serial number.
The format types of the public text in the embodiment of the present application may include a standard guideline format, a serial number protocol format, and a text paragraph format. By way of example and not limitation, as shown in FIG. 7, an exemplary diagram of a standard referral format announcement text; as shown in fig. 8, which is an exemplary diagram of the announcement text in the serial number protocol format; as shown in fig. 9, an exemplary diagram of a text paragraph format bulletin text is shown.
Referring to the specific flowchart of step S104 in fig. 1 shown in fig. 4, the specific flowchart for obtaining the format type of the to-be-processed advertisement text based on the text content number may be:
step S401, obtaining a directory structure of the bulletin text to be processed according to the text content sequence number.
The target structure is a logic structure for representing the content structure of the bulletin text to be processed.
It should be noted that, when the format type of the bulletin text to be processed is the text paragraph format, the bulletin text to be processed does not have the text content serial number, and after the text content serial number is extracted from the structured bulletin text according to the preset regular condition, if the text content serial number extracted from the structured public text does not exist, the format type of the bulletin text to be processed can be directly obtained as the text paragraph format.
Specifically, the specific process of obtaining the directory structure of the bulletin text to be processed according to the text content sequence number may be:
the method comprises the following steps of firstly, obtaining a hierarchy matched with a text content sequence number.
The hierarchy is a level of the text content number. For example, the text content sequence number is "one, two, three … …", and accordingly, the hierarchy of the text content sequence number is the first level; the text content serial numbers are (one), (two) and (three) … …', and correspondingly, the hierarchy of the text content serial numbers is the second level; the text content serial numbers are 1,2 and 3 … …, and correspondingly, the hierarchy of the text content serial numbers is the third level; the text content serial numbers are "(1), (2) and (3) … …", and correspondingly, the hierarchy of the text content serial numbers is the fourth level.
And secondly, extracting paragraph texts corresponding to the text content sequence numbers from the structured bulletin texts.
It can be understood that the text content sequence number corresponding to each paragraph text is unique, and the paragraph text corresponding to the text content sequence number can be extracted from the structured advertisement text according to the text content sequence number.
And thirdly, forming a directory structure of the bulletin text to be processed according to the text content serial number, the hierarchy matched with the text content serial number and the paragraph text corresponding to the text content serial number.
And S402, carrying out keyword identification on the directory structure to obtain the format type of the bulletin text to be processed.
It can be understood that the format types of the bulletin text to be processed are different, and the text contents of the paragraphs corresponding to the same hierarchy are also different, and the format type of the bulletin text to be processed can be obtained by performing keyword recognition on the paragraph text of a specific hierarchy.
Specifically, a target hierarchy of the directory structure is firstly screened out, and then the format type of the bulletin text to be processed is obtained according to the result of matching the paragraph text corresponding to the target hierarchy with the preset keyword. Wherein, if the target hierarchy is the first hierarchy, the preset keywords may include holding, attending, reviewing, resolution, voting, and standby.
For example, the following steps are carried out: as shown in fig. 7, if the paragraph texts corresponding to the first level of the advertisement text illustrated in fig. 7 are "proctor meeting holding situation", "proctor meeting examination situation", and "review text", respectively, then the matching result of the paragraph texts corresponding to the first level of the advertisement text illustrated in fig. 7 and the preset keyword is 3, and if the preset matching threshold is 3, it indicates that the format type of the advertisement text illustrated in fig. 7 is the standard guide format.
Of course, if the first level is a target level, and the matching result of the paragraph text corresponding to the first level and the preset keyword does not meet the requirement of the preset matching threshold, the lower levels such as the second level or the third level can be used as the target level, and the corresponding paragraph text and the preset keyword are matched, so that the effect of accurately judging the format type of the bulletin text to be processed is achieved. Since the process of matching the paragraph text corresponding to the lower hierarchy level, such as the second hierarchy level or the third hierarchy level, with the preset keyword is the same as the process of matching the paragraph text corresponding to the first hierarchy level with the preset keyword, no further description is given here.
The advantageous effects of the embodiments of the present application are explained in two aspects below.
On one hand, the bulletin texts to be processed are subjected to structured processing, so that the bulletin texts to be processed can be classified by extracting contents from the structured public texts in a targeted manner according to preset regular conditions, the problems that all extraction conditions are difficult to cover and condition conflicts are easy to generate in the process of directly classifying the formats of the three bulletin texts in the prior art in a character string regularization processing mode are solved, and the effect of accurately classifying the bulletin texts is achieved.
On the other hand, in the prior art, formats of the bulletin texts are classified in a machine learning manner, but the machine learning manner needs a large number of standard learning samples and needs to manually mark the large number of standard learning samples, so that the classification accuracy is low and the practical application value is not high.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.
Example two
Fig. 5 shows a block diagram of a classification apparatus for a bulletin text format provided in the embodiment of the present application, and for convenience of description, only the relevant parts of the application embodiment are shown.
Referring to fig. 5, the apparatus includes:
an obtaining module 51, configured to obtain a bulletin text to be processed;
the structure processing module 52 is configured to perform structural processing on the bulletin text to be processed to obtain a structural bulletin text;
an extracting module 53, configured to extract a text content sequence number from the structured bulletin text according to a preset regular condition;
and the identifying module 54 is configured to obtain a format type of the to-be-processed advertisement text based on the text content sequence number.
In one possible implementation manner, the obtaining module includes:
the acquisition submodule is used for acquiring a target bulletin file;
the analysis submodule is used for analyzing the bulletin text to be processed from the target bulletin file according to a preset text analyzer;
the acquisition module further comprises:
and the identification module is used for carrying out character identification on the target bulletin file and forming a bulletin text to be processed based on the character identification result.
In one possible implementation, the structuring module comprises:
the search submodule is used for searching paragraph characteristics of the bulletin text to be processed;
the division submodule is used for carrying out paragraph division on the bulletin text to be processed according to the paragraph characteristics to obtain the paragraph text of the bulletin text to be processed;
and the structuring submodule is used for structuring the text content of the bulletin text to be processed into a grid table by taking each paragraph text as a unit grid to obtain the structured bulletin text.
In one possible implementation, the identification module includes:
the directory structure query submodule is used for obtaining the directory structure of the bulletin text to be processed according to the text content sequence number;
and the keyword identification submodule is used for carrying out keyword identification on the directory structure to obtain the format type of the bulletin text to be processed.
In one possible implementation, the directory structure query submodule includes:
the acquisition unit is used for acquiring the hierarchy matched with the text content sequence number;
the extracting unit is used for extracting paragraph texts corresponding to the text content sequence numbers from the structural bulletin texts;
and the forming unit is used for forming a directory structure of the bulletin text to be processed according to the text content serial number, the hierarchy matched with the text content serial number and the paragraph text corresponding to the text content serial number.
In one possible implementation, the keyword recognition sub-module includes:
the screening unit is used for screening out a target hierarchy of the directory structure;
and the matching unit is used for obtaining the format type of the bulletin text to be processed according to the result of matching the paragraph text corresponding to the target level with the preset keywords.
It should be noted that, for the information interaction, execution process, and other contents between the above-mentioned devices/units, the specific functions and technical effects thereof are based on the same concept as those of the embodiment of the method of the present application, and reference may be made to the part of the embodiment of the method specifically, and details are not described here.
EXAMPLE III
Fig. 6 is a schematic structural diagram of a classification device provided in an embodiment of the present application. As shown in fig. 6, the sorting apparatus 6 of this embodiment includes: at least one processor 60, a memory 61 and a computer program 62 stored in the memory 61 and executable on the at least one processor 60, the processor 60 implementing the method steps in the first embodiment when executing the computer program 62.
The classification device 6 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The classification device may include, but is not limited to, a processor 60, a memory 61. Those skilled in the art will appreciate that fig. 6 is merely an example of the classification device 6, and does not constitute a limitation of the classification device 6, and may include more or less components than those shown, or combine some components, or different components, such as an input-output device, a network access device, and the like.
The Processor 60 may be a Central Processing Unit (CPU), and the Processor 60 may be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable gate array (FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 61 may in some embodiments be an internal storage unit of the sorting device 6, such as a hard disk or a memory of the sorting device 6, the memory 61 may in other embodiments also be an external storage device of the sorting device 6, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like provided on the sorting device 6 further, the memory 61 may also comprise both an internal storage unit of the sorting device 6 and an external storage device, the memory 61 is used for storing an operating system, applications, a Boot loader (Boot L loader), data and other programs, such as program code of the computer program or the like, the memory 61 may also be used for temporarily storing data that has been or will be output.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
The embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program can implement the method steps in the first embodiment.
The embodiment of the present application provides a computer program product, which when running on a classification device, enables the classification device to implement the method steps in the first embodiment when executed.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims (10)

1. A method for classifying a bulletin text format, comprising:
acquiring a bulletin text to be processed;
carrying out structuralization processing on the bulletin text to be processed to obtain a structuralization bulletin text;
extracting a text content sequence number from the structured bulletin text according to a preset regular condition;
and obtaining the format type of the bulletin text to be processed based on the text content sequence number.
2. The method for classifying bulletin text formats as claimed in claim 1, wherein the obtaining of bulletin texts to be processed comprises:
acquiring a target bulletin file;
analyzing the bulletin text to be processed from the target bulletin file according to a preset text analyzer; or, performing character recognition on the target bulletin file, and forming a bulletin text to be processed based on the character recognition result.
3. The method for classifying bulletin text formats as claimed in claim 1, wherein the step of performing a structuring process on the bulletin text to be processed to obtain a structured bulletin text comprises:
searching paragraph characteristics of the bulletin text to be processed;
according to the paragraph characteristics, paragraph division is carried out on the bulletin text to be processed, so that paragraph texts of the bulletin text to be processed are obtained;
and structuring the text content of the bulletin text to be processed into a grid table by taking each paragraph text as a unit grid to obtain a structured bulletin text.
4. The method for classifying bulletin text formats as claimed in claim 1, wherein obtaining the format type of the bulletin text to be processed based on the text content sequence number comprises:
obtaining a directory structure of the bulletin text to be processed according to the text content sequence number;
and carrying out keyword identification on the directory structure to obtain the format type of the bulletin text to be processed.
5. The method for classifying bulletin text formats as claimed in claim 4, wherein said obtaining the directory structure of the bulletin text to be processed according to the text content sequence number comprises:
acquiring a hierarchy matched with the text content sequence number;
extracting paragraph texts corresponding to the text content sequence numbers from the structural bulletin texts;
and forming a directory structure of the bulletin text to be processed according to the text content sequence number, the hierarchy matched with the text content sequence number and the paragraph text corresponding to the text content sequence number.
6. The method for classifying bulletin text formats as claimed in claim 5, wherein the step of performing keyword recognition on the directory structure to obtain the format type of the bulletin text to be processed comprises:
screening out a target level of the directory structure;
and obtaining the format type of the bulletin text to be processed according to the result of matching the paragraph text corresponding to the target level with the preset keywords.
7. The method for classifying bulletin text formats as claimed in any one of claims 1 to 6, wherein before extracting text content sequence numbers from the structured bulletin text according to a preset regularization condition, the method further comprises:
obtaining a notice text sample;
identifying text content sequence number sample characteristics of the bulletin text samples;
and carrying out preset logic algorithm processing on the text content serial number sample characteristics to obtain the preset regular condition.
8. An apparatus for classifying a bulletin text format, comprising:
the acquisition module acquires the bulletin text to be processed;
the structure processing module is used for carrying out structural processing on the bulletin text to be processed to obtain a structural bulletin text;
the extraction module is used for extracting a text content serial number from the structured bulletin text according to a preset regular condition;
and the identification module is used for obtaining the format type of the bulletin text to be processed based on the text content sequence number.
9. A classification device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.
CN202010231944.5A 2020-03-27 2020-03-27 Method, device and equipment for classifying bulletin text formats Pending CN111460151A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010231944.5A CN111460151A (en) 2020-03-27 2020-03-27 Method, device and equipment for classifying bulletin text formats

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010231944.5A CN111460151A (en) 2020-03-27 2020-03-27 Method, device and equipment for classifying bulletin text formats

Publications (1)

Publication Number Publication Date
CN111460151A true CN111460151A (en) 2020-07-28

Family

ID=71681556

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010231944.5A Pending CN111460151A (en) 2020-03-27 2020-03-27 Method, device and equipment for classifying bulletin text formats

Country Status (1)

Country Link
CN (1) CN111460151A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180189299A1 (en) * 2017-01-04 2018-07-05 Red Hat, Inc. Content aggregation for unstructured data
CN109684457A (en) * 2018-12-27 2019-04-26 清华大学 A kind of method and system that personal share advertisement data is extracted
CN110909123A (en) * 2019-10-23 2020-03-24 深圳价值在线信息科技股份有限公司 Data extraction method and device, terminal equipment and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180189299A1 (en) * 2017-01-04 2018-07-05 Red Hat, Inc. Content aggregation for unstructured data
CN109684457A (en) * 2018-12-27 2019-04-26 清华大学 A kind of method and system that personal share advertisement data is extracted
CN110909123A (en) * 2019-10-23 2020-03-24 深圳价值在线信息科技股份有限公司 Data extraction method and device, terminal equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110851598B (en) Text classification method and device, terminal equipment and storage medium
US20140180934A1 (en) Systems and Methods for Using Non-Textual Information In Analyzing Patent Matters
CN110674360B (en) Tracing method and system for data
CN110909123B (en) Data extraction method and device, terminal equipment and storage medium
CN109933502B (en) Electronic device, user operation record processing method and storage medium
CN107748772B (en) Trademark identification method and device
CN115392235A (en) Character matching method and device, electronic equipment and readable storage medium
CN111177387A (en) User list information processing method, electronic device and computer readable storage medium
CN111460151A (en) Method, device and equipment for classifying bulletin text formats
CN111460152A (en) Extraction method, extraction device and extraction equipment for announcement text content
CN110909112B (en) Data extraction method, device, terminal equipment and medium
CN111125483B (en) Webpage data extraction template generation method and device, computer device and storage medium
CN113656538A (en) Method and device for generating regular expression, computing equipment and storage medium
CN112488557A (en) Automatic calculation method, device and terminal based on grading standard objective scores
CN110580243A (en) file comparison method and device, electronic equipment and storage medium
CN104899572A (en) Content-detecting method and device, and terminal
CN111552638A (en) Code detection method and device
CN111581950A (en) Method for determining synonym and method for establishing synonym knowledge base
CN112199466B (en) Method and device for identifying associated rule of mail
CN112784593B (en) Document processing method and device, electronic equipment and readable storage medium
CN115187153B (en) Data processing method and system applied to business risk tracing
CN111259259B (en) University student news recommendation method, device, equipment and storage medium
CN108090139B (en) File retrieval method and device
CN116863493A (en) Image recognition method and device, and image semantic query method and device
CN114218168A (en) Document classification method and device, electronic equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination