CN115965002A - Data processing method, data processing apparatus, electronic device, storage medium, and program product - Google Patents

Data processing method, data processing apparatus, electronic device, storage medium, and program product Download PDF

Info

Publication number
CN115965002A
CN115965002A CN202310081628.8A CN202310081628A CN115965002A CN 115965002 A CN115965002 A CN 115965002A CN 202310081628 A CN202310081628 A CN 202310081628A CN 115965002 A CN115965002 A CN 115965002A
Authority
CN
China
Prior art keywords
text
text line
line
processed
acquiring
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310081628.8A
Other languages
Chinese (zh)
Inventor
于娟娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zitiao Network Technology Co Ltd
Original Assignee
Beijing Zitiao Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zitiao Network Technology Co Ltd filed Critical Beijing Zitiao Network Technology Co Ltd
Priority to CN202310081628.8A priority Critical patent/CN115965002A/en
Publication of CN115965002A publication Critical patent/CN115965002A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The present disclosure provides a data processing method, an apparatus, an electronic device, a storage medium, and a program product. The method comprises the following steps: acquiring at least one first text line matched with target information in a text to be processed; acquiring a second text line meeting a first preset condition from at least one first text line, and acquiring format characteristic information of the second text line; acquiring at least one third text line matched with the format characteristic information of the second text line in the text to be processed; and acquiring a fourth text line meeting a second preset condition in at least one third text line, and determining an information extraction condition of the text to be processed based on the fourth text line and the first text line. The method and the device can supplement the missing layout blocks in layout analysis, so that the finally extracted structured text is more accurate.

Description

Data processing method, data processing apparatus, electronic device, storage medium, and program product
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a data processing method, an apparatus, an electronic device, a storage medium, and a program product.
Background
Layout analysis is applied to text analysis, and text lines representing the same type of content can be divided into one layout block through the layout analysis. At present, layout analysis of a text is generally realized by adopting a layout analysis technology based on rules.
However, since the writing formats of texts vary, the names of the same layout block in different texts are different, and it is difficult for the layout analysis technology based on rules to comprehensively identify all layout block information, and some layout blocks are easily missed.
Disclosure of Invention
In view of the above, the present disclosure is directed to a data processing method, an apparatus, an electronic device, a storage medium, and a program product.
In view of the above object, a first aspect of the present disclosure provides a data processing method, including:
acquiring at least one first text line matched with target information in a text to be processed;
acquiring a second text line meeting a first preset condition from at least one first text line, and acquiring format characteristic information of the second text line;
acquiring at least one third text line matched with the format characteristic information of the second text line in the text to be processed;
and acquiring a fourth text line meeting a second preset condition in at least one third text line, and determining an information extraction condition of the text to be processed based on the fourth text line and the first text line.
In some embodiments, the obtaining at least one first text line matching the target information in the text to be processed includes:
and matching any text line in the text to be processed with the pre-acquired target information, and determining the text line as the first text line when any text line is the same as the target information.
In some embodiments, the first preset condition comprises a confidence level; the obtaining, from at least one of the first text lines, a second text line satisfying a first preset condition includes:
and determining at least one text line with highest confidence in at least one first text line as the second text line.
In some embodiments, the format characteristic information of the second text line includes at least one of a font color, a font name, a font size, whether a font is bolded, whether a font is tilted, and whether a font is underlined of the second text line.
In some embodiments, obtaining at least one third text line in the text to be processed, which matches the format feature information of the second text line, includes:
acquiring format characteristic information of each text line in the text to be processed;
matching the format characteristic information of each text line with the format characteristic information of the second text line, and judging whether the format characteristic information of each text line is the same as the format characteristic information of the second text line;
and when the format characteristic information of the text line is the same as that of the second text line, confirming the text line as the third text line.
In some embodiments, the obtaining a fourth text line satisfying a second preset condition in the third text line includes:
determining a text line of which the text content in the third text line meets a preset condition as the fifth text line;
and determining the text line with the relation between the text lines in the fifth text line meeting the preset condition as the fourth text line.
In some embodiments, the text content in the third text line meets a preset condition, which includes at least one of:
the text content of the third text line does not comprise a preset named entity;
the text length of the third text line is less than or equal to a first preset value;
the third text line is aligned in the same way as the first text line;
the third line of text does not include punctuation marks, or the third line of text and the first line of text include the same punctuation marks;
the text content of the third text line is of a preset language type, and the text content of the third text line includes at least one of the following: the number of words is less than or equal to a second preset value, including characters or letters matched with the preset language type, and not including characters, letters or numbers not matched with the preset language type.
In some embodiments, the relationship between each text line in the fifth text line meets a preset condition, including at least one of:
the first text line does not exist in a range which is separated from any fifth text line by a preset line number;
the text contents of any two text lines in the fifth text line are different;
the total number of the fifth text lines is less than or equal to a third preset value.
A second aspect of the present disclosure provides a data processing apparatus comprising:
a reference information acquisition module configured to: acquiring at least one first text line matched with target information in a text to be processed;
a format feature acquisition module configured to: acquiring a second text line meeting a first preset condition from at least one first text line, and acquiring format characteristic information of the second text line;
a feature matching module configured to: acquiring at least one third text line matched with the format characteristic information of the second text line in the text to be processed;
a determination module configured to: and acquiring a fourth text line meeting a second preset condition in at least one third text line, and determining an information extraction condition of the text to be processed based on the fourth text line and the first text line.
A third aspect of the present disclosure provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method according to the first aspect of the present disclosure when executing the program.
A fourth aspect of the disclosure provides a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method of the first aspect.
A fifth aspect of the disclosure provides a computer program product comprising computer program instructions that, when run on a computer, cause the computer to perform the method of the first aspect.
As can be seen from the above, according to the data processing method, the data processing apparatus, the electronic device, the storage medium, and the program product provided by the present disclosure, after obtaining a text to be processed, first text lines, that is, section titles based on rules, are screened out based on preset target information, second text lines that can represent the section titles most among the first text lines are obtained, each text line of the text to be processed is matched based on format characteristic information of the second text line, so that a text line that is matched with the format characteristic information of the second text line, that is, a third text line, is selected, text lines that do not meet requirements in the third text line are removed to obtain a fourth text line, and then a combination of the fourth text line and the first text line is used as a subsequent entity relationship extraction condition, so that the section information that cannot be identified can be effectively supplemented by the rules, accuracy of a subsequent entity relationship extraction process is ensured, and accuracy of finally extracted text information is higher.
Drawings
In order to more clearly illustrate the technical solutions in the present disclosure or related technologies, the drawings needed to be used in the description of the embodiments or related technologies are briefly introduced below, and it is obvious that the drawings in the following description are only embodiments of the present disclosure, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1A shows a schematic diagram of an exemplary pending text.
FIG. 1B illustrates a schematic diagram of an exemplary pending text.
FIG. 2 shows a flow diagram of an exemplary method.
Fig. 3 illustrates a schematic diagram of an exemplary apparatus provided by an embodiment of the present disclosure.
Fig. 4 shows a hardware structure diagram of an exemplary computer device provided by the embodiment of the present disclosure.
Detailed Description
To make the objects, technical solutions and advantages of the present disclosure more apparent, the present disclosure will be described in further detail below with reference to specific embodiments and the accompanying drawings.
It is to be noted that technical terms or scientific terms used in the embodiments of the present disclosure should have a general meaning as understood by those having ordinary skill in the art to which the present disclosure belongs, unless otherwise defined. The use of "first," "second," and similar terms in the embodiments of the disclosure is not intended to indicate any order, quantity, or importance, but rather to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.
Information Extraction (IE) is an important task in Natural Language Processing (NLP). Information extraction can be understood as: according to the user requirements, short content meeting the requirements is extracted from a long text through some technical means. For example, a user needs to extract "content of a desired position" from a resume, and then uses some technical means to realize the process of the need, which is called information extraction.
Or the resume in the text format can be converted into the structured resume in an information extraction mode, so that the management and the use of the resume are facilitated.
The layout analysis is used as a link in the parsing process of texts such as resumes and the like, and can divide text lines representing the same type of content into a plate, for example, the text lines describing education experiences are divided into an education edition block, the text lines describing work experiences are divided into a career edition block, and the text lines describing personal information are divided into a basic _ info edition block, so that the division brings great help to the extraction of the final entity relationship.
Currently, rule-based layout blocking techniques are commonly employed to identify individual layout block information. The method is characterized in that a large number of section titles are collected, then titles in texts such as resumes are identified based on the collected section titles, and the texts between the titles and the next titles are divided into the sections, so that layout analysis is realized.
However, since the written format and title names of resumes vary from person to person, it is difficult to collect all the title names that may appear, which may result in rule-based layout blocking techniques missing to identify partial sections, thereby causing one independent section to be divided into another. For example, in the resume shown in fig. 1A, titles of the respective sections are named normally, and the resume can be divided correctly in a rule-based manner, whereas in the resume shown in fig. 1B, since a title name "education stage introduction" of the education experience does not appear in a rule title collected in advance, the education experience module fails to recognize, and the education experience is divided into the work experience modules, which may have a great influence on subsequent entity relationship extraction.
In view of this, the embodiments of the present disclosure provide a data processing method to solve the above problems. As shown in fig. 2, the method includes:
step S101, at least one first text line matched with the target information in the text to be processed is obtained.
The text to be processed can be any text with information extraction requirements. The types of text to be processed may include resumes, airtickets, invoices, and the like. The text to be processed may have a template, and the template of the text to be processed may be determined by title names and specific contents of each section in the text, as shown in fig. 1A, the title names may be base information, work experience, educational experience, and the like, and the specific contents may be contents corresponding to each title name, for example, the specific contents corresponding to the base information may include names, telephones, mailboxes, and the like, and work experience. For example, the corresponding specific content may include a job company, a position, a work time, and the like, which is not limited in this embodiment.
The texts to be processed belonging to the same type may have different templates. For example, air drafts in different countries typically employ different templates. As another example, the resume may have different templates. The embodiment of the application does not limit the type of the text to be processed, nor the template of the text to be processed.
In one example, for a paper resume, the user may use the terminal to take the paper resume to obtain a picture. The terminal identifies the character of the picture, so that the information on the paper resume is translated into computer characters. For example, the terminal may employ Optical Character Recognition (OCR) technology to translate the information on the paper resume into computer text. The text containing the computer words may be used as the text to be processed in step S101.
The target information may be information acquired by the user in advance, for example, a large number of section titles that may be collected by the user in advance. Taking the resume as an example, the user may collect in advance title names of various sections in various resume templates as target information, including: basic information, work experience, educational experience, skills, etc.
When the text to be processed is acquired, each text line of the text to be processed may be matched with the target information, so as to acquire one or more first text lines matched with the target information. These first lines of text are the rule-based section titles.
Step S103, obtaining a second text line meeting a first preset condition from at least one first text line, and obtaining format feature information of the second text line.
In this embodiment, since the section titles extracted based on the rule are not necessarily all correct, that is, there may be text lines that are not title-named in the first text line obtained in step S101, a text line that is most likely to be title-named needs to be selected from the section titles extracted based on the rule, that is, the first text line. Thus, after obtaining the first text line, a second text line satisfying a first preset condition for enabling selection of a text line most likely to be title-named from the first text line may be selected from the first text line.
In the resume waiting processing text, the title name and the specific content are different in format, so that the title name and the specific content can be distinguished based on the format, and further the extraction of the title name of each section is realized. Therefore, in this embodiment, the format feature information of the second text line may be obtained by feature extraction or the like.
Second text lines can be selected from the first text lines, and then format characteristic information of each second text line is obtained; alternatively, the format characteristic information of each first text line may be obtained first, and then the second text line and the format characteristic information of the second text line are selected from the obtained first format characteristic information.
Step S105, at least one third text line matched with the format feature information of the second text line in the text to be processed is obtained.
Since the title name is usually different from the format of the specific content and the second text line is the text line in the first text line that is most likely to be the title name, it is highly likely that the text line in the text to be processed that matches the format feature information of the second text line is also the text line where the title name is located.
Therefore, in this step, the format feature information of each text line in the text to be processed may be obtained, the format feature information of each text line is matched with the format feature information of the second text line, and a third text line successfully matched is obtained, where the text content of the third text line is likely to be named for a title.
Step S107, a fourth text line meeting a second preset condition in at least one third text line is obtained, and an information extraction condition of the text to be processed is determined based on the fourth text line and the first text line.
In this embodiment, since there still exists a text line that may not be named by a title in the third text line, the third text line may be further filtered based on the second preset condition. The second preset condition may be used to determine whether the text content of the third text line meets a name rule of title naming, and/or determine whether a relationship between a plurality of third text lines or a relationship between the third text line and the first text line (i.e., a rule-based section title) meets a rule of title naming.
Therefore, text lines such as text contents in the third text line that do not meet the rule of title naming can be excluded based on the second preset condition, and a text line that meets the second preset condition, that is, a fourth text line, is obtained. And finally, both the fourth text line and the first text line can be used as information extraction conditions of the text to be processed, and the layout of the text to be processed is divided based on the information extraction conditions, so that the subsequent entity relationship extraction is realized.
In the embodiment, after the text to be processed is obtained, first text lines, namely, the section titles based on the rules are screened out based on preset target information, second text lines, which can represent the section titles most, are obtained, each text line of the text to be processed is matched based on format characteristic information of the second text line, so that a text line, namely, a third text line, which is matched with the format characteristic information of the second text line is selected, the text line which does not meet the requirement in the third text line is removed to obtain a fourth text line, and then the combination of the fourth text line and the first text line is used as a subsequent entity relationship extraction condition, so that the section information which cannot be identified by the rules can be effectively extracted, the accuracy of a subsequent entity relationship extraction process is ensured, and the accuracy of the finally extracted text information is higher.
In some embodiments, the obtaining at least one first text line matching the target information in the text to be processed in step S101 includes: and matching any text line in the text to be processed with the pre-acquired target information, and determining the text line as the first text line when any text line is the same as the target information.
In this embodiment, a large number of section titles that can be collected by the user in advance are stored as target information. And if the text line is the same as any section title in the target information, determining the text line as a first text line, thereby obtaining the section title based on the rule.
For example, when the target information includes the title names of four sections of basic information, work experience, education experience, and skills, the title names of four sections in the to-be-processed resume text as shown in fig. 1A will be extracted successfully, while the title names of three sections of basic information, work experience, and skills in the to-be-processed resume text as shown in fig. 1B can give success to the extraction, and the education stage introduction cannot be extracted because it cannot be matched with the target information.
In some embodiments, the first preset condition comprises a confidence level. In step S101, the obtaining a second text line satisfying a first preset condition from at least one first text line includes: and confirming at least one text line with highest confidence in at least one first text line as the second text line.
In this embodiment, since the section titles extracted based on the rule are not necessarily all correct, the confidence of each first text line is calculated, and the N first text lines with the highest confidence are found as the reference bases, that is, the second text lines. Wherein, N can be selected according to the requirement.
When the text to be processed is the resume text to be processed, the second text line can generally select the title names of the education experience section and the work experience section, and obtain the format characteristic information of the education experience section and the work experience section according to the title names of the education experience section and the work experience section as reference matrixes, so as to realize subsequent title name screening.
In some embodiments, the format characteristic information of the second text line includes at least one of a font color, a font name, a font size, whether the font is bold, whether the font is tilted, whether the font is underlined, and other format characteristics of the second text line.
In this embodiment, the template may be generated based on the format feature information of the second text line. In some specific embodiments, the format of the template may be fontColor _ fontName _ fontSize _ bold _ form, where fontColor represents the font color, fontName represents the font name, fontSize represents the font size, bold represents whether the font is bolded, and form represents the category. For example, as shown in fig. 1B, a template may be generated based on the format feature information of the "work experience" with the title name of the work experience section as a reference.
After the template of the second text line is generated, each text line in the text to be processed can be judged based on the template, and whether the text line is matched with the template or not is determined.
In some embodiments, the obtaining at least one third text line in the text to be processed, which matches the format feature information of the second text line, in step S105 includes:
step S201, obtaining format feature information of each text line in the text to be processed.
Step S203, matching the format characteristic information of each text line with the format characteristic information of the second text line, and determining whether the format characteristic information of each text line is the same as the format characteristic information of the second text line.
Step S205, when the format characteristic information of the text line is the same as the format characteristic information of the second text line, determining the text line as the third text line.
In this embodiment, the format feature information of each text line in the text to be processed is extracted, and the format feature information of each text line is matched with the format feature information of the second text line. When the format characteristic information of the text line is the same as that of the second text line, which indicates that the format of the text line is the same as that of the second text line, the text line may be named by a title; and when the format characteristic information of the text line is different from that of the second text line, which indicates that the format of the text line is different from that of the second text line, the text line is not named by the title with a high probability.
As shown in fig. 1B, when the format characteristic information of the "education phase introduction" is the same as the format characteristic information of the "work history" as a reference, the "education phase introduction" is likely to be named for the title.
Optionally, in this embodiment, a template for each text line may be generated based on the format feature information of each text line in the text to be processed, and then the templates of the text lines are compared with the template of the second text line, and the text line with the same template is selected as the third text line.
That is, in this embodiment, a third text line in the text to be processed, which has the same format as the text line determined as the title name, may be obtained, and the third text line has a great possibility of being the title name.
In some embodiments, the third line of text may be further filtered based on a second preset condition, since there are still lines of text in the third line of text that may not be title named. Wherein it may be determined whether the text content of the third text line complies with the title naming rules and/or whether the relationship between the plurality of third text lines or the relationship between the third text line and the first text line (i.e. the rule-based section title) complies with the title naming rules to determine whether each third text line is title-named.
In this embodiment, the acquiring, in step S107, a fourth text line that satisfies a second preset condition in the third text line includes: determining a text line of which the text content meets a preset condition in the third text line as the fifth text line; and determining the text line with the relation between the text lines in the fifth text line meeting the preset condition as the fourth text line.
In this embodiment, the text content in the third text line meets a preset condition, which includes at least one of: the text content of the third text line does not include a preset named entity; the text length of the third text line is less than or equal to a first preset value; the third text line is aligned in the same way as the first text line; the third line of text does not include punctuation marks, or the third line of text and the first line of text include the same punctuation marks; the text content of the third text line is of a preset language type, and the text content of the third text line includes: the number of words is less than or equal to a second preset value, and the words comprise at least one of characters or letters matched with the preset language type and characters, letters or numbers not matched with the preset language type.
The named entities refer to entities having specific meanings in the text, or things that can be identified by proper nouns (or names), and a named entity generally represents only one specific entity, for example, the specific entity may include names of people, places, organizations or other proper nouns, and may also include time, quantity, currency, ratio values, and the like. However, title naming generally does not have a named entity, and thus when the text content of a third text line includes a named entity, it is not necessarily title naming.
The text length of the third text line is obtained based on the difference between the left and right boundary coordinate values of the text line, rather than being determined according to the number of words. Usually, the text length of the title name is short. When a text line has a smaller number of words but a continuous space in the middle, the text length of the third text line is longer, and in this case the text line is not usually named by title. Therefore, when the text length of the third text line is greater than the first preset value, the third text line is not named by the title, and therefore the third text line needs to be deleted. The numerical value of the first preset value may be set based on conditions such as the type of the text to be processed, which is not limited in this embodiment.
The manner in which titles are named is typically the same in the same pending text. For example, if the determined title name (i.e. the first text line) is left-justified, the other title names should also be left-justified, on the premise that the text to be processed is not left-right typeset. Therefore, it is necessary to determine whether the third text line is left aligned, if so, the third text line may be title named, and if not, the third text line is not title named, so that the third text line needs to be deleted.
Usually, the title name is not punctuation, and in this case the third line of text is not title name if it includes punctuation. In some cases, punctuation may also exist in the title name, for example, the title name may be "one, basic information", and when punctuation exists in the title name, the punctuation of the title name in the same text to be processed should be the same. At this time, if the third text line and the first text line have the same punctuation mark, the third text line may be a title name; if the third text line and the first text line have different punctuation marks, the third text line is not named by a title, and therefore the third text line needs to be deleted.
When the title is named as a certain preset language type, the text content of the title also has preset rules correspondingly. Taking english as an example, each english word (e.g., experiment, education, etc.) may correspond to multiple chinese characters, so the number of words is not too large when a title is named by writing english, and if the number of words is greater than a second preset value (e.g., 5), the third text line is likely not to be named by the title. If the third line of text does not contain words or letters that match the predetermined language type, such as the title being english but the third line of text does not contain letters, then the third line of text is not a title name. If the third line of text includes words, letters, or numbers that do not match the predetermined language type, such as the title being english but the third line of text including numbers, then the third line of text is likely not title-named and therefore the third line of text needs to be deleted.
In this embodiment, after deleting the text line that does not meet the preset condition, the remaining text line in the third text line (i.e. the fifth text line) may be the title name. In this embodiment, the preset conditions may be set according to actual needs, for example, the text content of the third text line may be set to simultaneously satisfy one or more text lines to be used as the fifth text line, or the text content of the third text line may be set to simultaneously satisfy all the preset conditions to be used as the fifth text line, which is not limited in this embodiment.
In some embodiments, the relationship between each text line in the fifth text line meets a preset condition, including at least one of: the first text line does not exist in a range of a preset line number away from any fifth text line; the text contents of any two text lines in the fifth text line are different; the total number of the fifth text lines is less than or equal to a third preset value.
The number of lines apart from the predetermined line may be 2, and no title name exists between any fifth text line and the predetermined line, that is, between the first two lines (including the fifth text line) and the last two lines (including the fifth text line) of the fifth text line. In this embodiment, each section generally includes a title name and specific content, that is, each section will include a title name and specific content, and will not include only a title name. Therefore, if there are adjacent title names before and after the fifth text line, the fifth text line will not be the title name, and therefore the fifth text line needs to be deleted.
Since the same section and the same title name do not exist in the same text to be processed, the same title name does not exist, that is, one title name in the same text to be processed usually appears only once. Therefore, when the text contents of any two text lines in the fifth text line are the same, the fifth text line is not named by a title, and therefore the fifth text line needs to be deleted.
Since the number of the layout blocks in the same text to be processed is usually not too large, after the above operations, if the fifth text line is still larger than the third preset value (e.g. 4), these remaining fifth text lines are not the title names, and all the remaining fifth text lines are deleted at this time.
In this embodiment, after deleting the text line that does not meet the preset condition, the remaining text line (i.e. the fourth text line) in the fifth text line may be a title name. In this embodiment, the preset conditions may be set according to actual needs, for example, the text content of the fifth text line may be set to simultaneously satisfy one or more text lines to be used as the fourth text line, or the text content of the fifth text line may be set to simultaneously satisfy all the preset conditions to be used as the fourth text line, which is not limited in this embodiment.
After the fourth text line is determined, the fourth text line and the first text line can be named as a title together, namely, the fourth text line and the first text line are used as conditions for subsequent entity relationship extraction, so that the missing layout blocks of the rules can be effectively supplemented, the division of the whole page to be processed is more accurate finally, and the accuracy of the finally extracted structured text is further ensured.
It should be noted that the method of the embodiments of the present disclosure may be executed by a single device, such as a computer or a server. The method of the embodiment can also be applied to a distributed scene and is completed by the mutual cooperation of a plurality of devices. In such a distributed scenario, one of the devices may only perform one or more steps of the method of the embodiments of the present disclosure, and the devices may interact with each other to complete the method.
It should be noted that the above describes some embodiments of the disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments described above and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
Based on the same inventive concept, the present disclosure also provides a data processing apparatus corresponding to any of the above-described embodiments.
Referring to fig. 3, the apparatus comprises:
a reference information acquisition module 11 configured to: acquiring at least one first text line matched with target information in a text to be processed;
a format feature acquisition module 13 configured to: acquiring a second text line meeting a first preset condition from at least one first text line, and acquiring format characteristic information of the second text line;
a feature matching module 15 configured to: acquiring at least one third text line matched with the format feature information of the second text line in the text to be processed;
a determination module 17 configured to: and acquiring a fourth text line meeting a second preset condition in at least one third text line, and determining an information extraction condition of the text to be processed based on the fourth text line and the first text line.
In some embodiments, the reference information obtaining module 11 is further configured to: and matching any text line in the text to be processed with the pre-acquired target information, and determining the text line as the first text line when any text line is the same as the target information.
In some embodiments, the first preset condition comprises a confidence level; the format feature obtaining module 13 is further configured to: and determining at least one text line with highest confidence in at least one first text line as the second text line.
In some embodiments, the format characteristic information of the second text line includes at least one of a font color, a font name, a font size, whether a font is bolded, whether a font is tilted, and whether a font is underlined of the second text line.
In some embodiments, the feature matching module 15 is further configured to: acquiring format characteristic information of each text line in the text to be processed; matching the format characteristic information of each text line with the format characteristic information of the second text line, and judging whether the format characteristic information of each text line is the same as the format characteristic information of the second text line; and when the format characteristic information of the text line is the same as that of the second text line, confirming the text line as the third text line.
In some embodiments, the determining module 17 is further configured to: determining a text line of which the text content meets a preset condition in the third text line as the fifth text line; and determining the text line with the relation between the text lines in the fifth text line meeting the preset condition as the fourth text line.
In some embodiments, the text content in the third text line meets a preset condition, which includes at least one of:
the text content of the third text line does not include a preset named entity;
the text length of the third text line is less than or equal to a first preset value;
the third text line is aligned in the same way as the first text line;
the third line of text does not include punctuation marks, or the third line of text and the first line of text include the same punctuation marks;
the text content of the third text line is of a preset language type, and the text content of the third text line includes at least one of the following: the number of words is less than or equal to a second preset value, including characters or letters matched with the preset language type, and not including characters, letters or numbers not matched with the preset language type.
In some embodiments, the relationship between each text line in the fifth text line meets a preset condition, including at least one of:
the first text line does not exist in a range of a preset line number away from any fifth text line;
the text contents of any two text lines in the fifth text line are different;
the total number of the fifth text lines is less than or equal to a third preset value.
For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. Of course, the functionality of the various modules may be implemented in the same one or more software and/or hardware implementations of the present disclosure.
The apparatus of the foregoing embodiment is used to implement the corresponding method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.
Based on the same inventive concept, corresponding to any of the above-mentioned embodiments, the present disclosure further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and running on the processor, and when the processor executes the program, the method according to any of the above embodiments is implemented.
Fig. 4 is a schematic diagram illustrating a more specific hardware structure of an electronic device according to this embodiment, where the electronic device may include: a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. Wherein the processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 are communicatively coupled to each other within the device via bus 1050.
The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present disclosure.
The Memory 1020 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static Memory device, a dynamic Memory device, or the like. The memory 1020 may store an operating system and other application programs, and when the technical solutions provided by the embodiments of the present specification are implemented by software or firmware, the relevant program codes are stored in the memory 1020 and called by the processor 1010 for execution.
The input/output interface 1030 is used for connecting an input/output module to input and output information. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.
The communication interface 1040 is used for connecting a communication module (not shown in the drawings) to implement communication interaction between the present apparatus and other apparatuses. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, bluetooth and the like).
Bus 1050 includes a path that transfers information between various components of the device, such as processor 1010, memory 1020, input/output interface 1030, and communication interface 1040.
It should be noted that although the above-mentioned device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.
The electronic device of the above embodiment is used to implement the corresponding method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.
Based on the same inventive concept, the present disclosure also provides a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method according to any of the above embodiments, corresponding to any of the above-described embodiment methods.
Computer-readable media of the present embodiments, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device.
The computer instructions stored in the storage medium of the above embodiment are used to enable the computer to execute the method according to any of the above embodiments, and have the beneficial effects of the corresponding method embodiment, and are not described herein again.
Based on the same inventive concept, the present disclosure also provides a computer program product, corresponding to any of the above-described embodiment methods, comprising a computer program. In some embodiments, the computer program is executable by one or more processors to cause the processors to perform the method. Corresponding to the execution subject corresponding to each step in the embodiments of the method, the processor executing the corresponding step may be the corresponding execution subject.
The computer program product of the foregoing embodiment is used to enable a processor to execute the method according to any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the idea of the present disclosure, also technical features in the above embodiments or in different embodiments may be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the embodiments of the present disclosure as described above, which are not provided in detail for the sake of brevity.
In addition, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown within the provided figures for simplicity of illustration and discussion, and so as not to obscure the embodiments of the disclosure. Furthermore, devices may be shown in block diagram form in order to avoid obscuring embodiments of the present disclosure, and this also takes into account the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the embodiments of the present disclosure are to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the disclosure, it should be apparent to one skilled in the art that the embodiments of the disclosure can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.
While the present disclosure has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those of ordinary skill in the art in light of the foregoing description. For example, other memory architectures, such as Dynamic RAM (DRAM), may use the discussed embodiments.
The disclosed embodiments are intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Therefore, any omissions, modifications, equivalents, improvements, and the like that may be made within the spirit and principles of the embodiments of the disclosure are intended to be included within the scope of the disclosure.

Claims (12)

1. A data processing method, comprising:
acquiring at least one first text line matched with target information in a text to be processed;
acquiring a second text line meeting a first preset condition from at least one first text line, and acquiring format characteristic information of the second text line;
acquiring at least one third text line matched with the format characteristic information of the second text line in the text to be processed;
and acquiring a fourth text line meeting a second preset condition in at least one third text line, and determining an information extraction condition of the text to be processed based on the fourth text line and the first text line.
2. The method according to claim 1, wherein the obtaining at least one first text line matching the target information in the text to be processed comprises:
and matching any text line in the text to be processed with the pre-acquired target information, and determining the text line as the first text line when any text line is the same as the target information.
3. The method of claim 1, wherein the first preset condition comprises a confidence level; the obtaining, from at least one of the first text lines, a second text line satisfying a first preset condition includes:
and determining at least one text line with highest confidence in at least one first text line as the second text line.
4. The method according to claim 1, wherein the format characteristic information of the second text line includes at least one of a font color, a font name, a font size, whether a font is bolded, whether a font is tilted, and whether a font is underlined of the second text line.
5. The method according to claim 1, wherein obtaining at least one third text line in the text to be processed, which matches the format feature information of the second text line, comprises:
acquiring format characteristic information of each text line in the text to be processed;
matching the format characteristic information of each text line with the format characteristic information of the second text line, and judging whether the format characteristic information of each text line is the same as the format characteristic information of the second text line;
and when the format characteristic information of the text line is the same as that of the second text line, confirming the text line as the third text line.
6. The method according to claim 1, wherein the obtaining a fourth text line satisfying a second preset condition in the third text line comprises:
determining a text line of which the text content meets a preset condition in the third text line as the fifth text line;
and determining the text line with the relation between the text lines in the fifth text line meeting the preset condition as the fourth text line.
7. The method of claim 6, wherein the text content in the third text line meets a preset condition, and comprises at least one of the following:
the text content of the third text line does not include a preset named entity;
the text length of the third text line is less than or equal to a first preset value;
the third text line is aligned in the same way as the first text line;
the third line of text does not include punctuation marks, or the third line of text and the first line of text include the same punctuation marks;
the text content of the third text line is of a preset language type, and the text content of the third text line includes at least one of the following: the number of words is less than or equal to a second preset value, including characters or letters matched with the preset language type, and not including characters, letters or numbers not matched with the preset language type.
8. The method according to claim 6, wherein the relationship between each text line in the fifth text line meets a preset condition, and comprises at least one of the following:
the first text line does not exist in a range of a preset line number away from any fifth text line;
the text contents of any two text lines in the fifth text line are different;
the total number of the fifth text lines is less than or equal to a third preset value.
9. A data processing apparatus, comprising:
a reference information acquisition module configured to: acquiring at least one first text line matched with target information in a text to be processed;
a format feature acquisition module configured to: acquiring a second text line meeting a first preset condition from at least one first text line, and acquiring format characteristic information of the second text line;
a feature matching module configured to: acquiring at least one third text line matched with the format characteristic information of the second text line in the text to be processed;
a determination module configured to: and acquiring a fourth text line meeting a second preset condition in at least one third text line, and determining an information extraction condition of the text to be processed based on the fourth text line and the first text line.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 8 when executing the program.
11. A non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1 to 8.
12. A computer program product comprising computer program instructions for causing a computer to perform the method of any one of claims 1 to 8 when the computer program instructions are run on a computer.
CN202310081628.8A 2023-01-13 2023-01-13 Data processing method, data processing apparatus, electronic device, storage medium, and program product Pending CN115965002A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310081628.8A CN115965002A (en) 2023-01-13 2023-01-13 Data processing method, data processing apparatus, electronic device, storage medium, and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310081628.8A CN115965002A (en) 2023-01-13 2023-01-13 Data processing method, data processing apparatus, electronic device, storage medium, and program product

Publications (1)

Publication Number Publication Date
CN115965002A true CN115965002A (en) 2023-04-14

Family

ID=87358266

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310081628.8A Pending CN115965002A (en) 2023-01-13 2023-01-13 Data processing method, data processing apparatus, electronic device, storage medium, and program product

Country Status (1)

Country Link
CN (1) CN115965002A (en)

Similar Documents

Publication Publication Date Title
US8107727B2 (en) Document processing apparatus, document processing method, and computer program product
KR101955732B1 (en) Associating captured image data with a spreadsheet
US9286526B1 (en) Cohort-based learning from user edits
KR20150128921A (en) Detection and reconstruction of east asian layout features in a fixed format document
KR20180048774A (en) System and method of digital note taking
CN110110290B (en) Method and device for setting typesetting style of electronic book
CN112149680B (en) Method and device for detecting and identifying wrong words, electronic equipment and storage medium
CN112801084A (en) Image processing method and device, electronic equipment and storage medium
CN109582934B (en) Format document conversion method and device
JP2012212293A (en) Document recognition device, document recognition method, program and storage medium
CN111046627A (en) Chinese character display method and system
CN111339910B (en) Text processing and text classification model training method and device
CN112417899A (en) Character translation method, device, computer equipment and storage medium
CN115965002A (en) Data processing method, data processing apparatus, electronic device, storage medium, and program product
US20210182477A1 (en) Information processing apparatus and non-transitory computer readable medium storing program
JP2019057137A (en) Information processing apparatus and information processing program
CN114548040A (en) Note processing method, electronic device and storage medium
JP7383882B2 (en) Information processing device and information processing program
CN113378526A (en) PDF paragraph processing method, device, storage medium and equipment
CN113111881A (en) Information processing apparatus and recording medium
CN106776489B (en) Electronic document display method and system of display device
CN117391045B (en) Method for outputting file with portable file format capable of copying Mongolian
JP7430219B2 (en) Document information structuring device, document information structuring method and program
CN110909723B (en) Information processing apparatus and computer-readable storage medium
JP4213558B2 (en) Document layout analysis program, computer-readable storage medium storing document layout analysis program, document layout analysis method, and document layout analysis apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination