CN112925837B - Text structuring method and device - Google Patents

Text structuring method and device Download PDF

Info

Publication number
CN112925837B
CN112925837B CN201911243760.4A CN201911243760A CN112925837B CN 112925837 B CN112925837 B CN 112925837B CN 201911243760 A CN201911243760 A CN 201911243760A CN 112925837 B CN112925837 B CN 112925837B
Authority
CN
China
Prior art keywords
probability
text
character
information
target character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911243760.4A
Other languages
Chinese (zh)
Other versions
CN112925837A (en
Inventor
张鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Goldway Intelligent Transportation System Co Ltd
Original Assignee
Shanghai Goldway Intelligent Transportation System Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Goldway Intelligent Transportation System Co Ltd filed Critical Shanghai Goldway Intelligent Transportation System Co Ltd
Priority to CN201911243760.4A priority Critical patent/CN112925837B/en
Publication of CN112925837A publication Critical patent/CN112925837A/en
Application granted granted Critical
Publication of CN112925837B publication Critical patent/CN112925837B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90348Query processing by searching ordered data, e.g. alpha-numerically ordered data

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention provides a text structuring method and a text structuring device, wherein the method comprises the following steps: the method comprises the steps of obtaining text information of a first file, wherein the text information comprises at least one text unit and position information of the at least one text unit, and the text unit comprises at least one character. And sequencing the at least one text unit according to the position information of the at least one text unit to obtain a target character string, wherein the target character string comprises characters in each text unit. And according to the probability information of each character in the target character string, acquiring a field result corresponding to the preset attention information in the target character string. And determining a structured file corresponding to the first file according to the preset attention information and a field result corresponding to the preset attention information. The method comprises the steps of determining a field result corresponding to a preset concern field based on probability information of each character in an ordered target character string to obtain a structured file corresponding to a first file, and accordingly reducing the difficulty in achieving text structuring.

Description

Text structuring method and device
Technical Field
The embodiment of the invention relates to computer technology, in particular to a text structuring method and device.
Background
With the development of science and technology, network delivery resumes become the mainstream recruitment mode, and effective information in the resume text can be quickly acquired through text structuring, so that the processing efficiency of the resumes is improved.
At present, a conventional text structuring method generally defines a layout block tag matching rule and a tag matching rule in advance, different regular expressions are adopted for each piece of information (such as name and mobile phone number) in the matching rule, and effective information is extracted by judging whether content in a text meets the regular expressions of the layout block tag matching rule and the information matching rule.
However, the regular expression is complex to define and has no commonality between different texts, which makes the implementation of text structuring difficult.
Disclosure of Invention
The embodiment of the invention provides a text structuring method and a text structuring device, which are used for overcoming the problem of difficulty in realizing text structuring caused by definition of a regular expression.
In a first aspect, an embodiment of the present invention provides a text structuring method, including:
acquiring text information of a first file, wherein the text information comprises at least one text unit and position information of the at least one text unit, and the text unit comprises at least one character;
sequencing the at least one text unit according to the position information of the at least one text unit to obtain a target character string, wherein the target character string comprises characters in each text unit;
acquiring a field result corresponding to preset attention information in the target character string according to probability information of each character in the target character string, wherein the preset attention information is used for indicating information required by a structured first file, the probability information comprises a starting probability and an ending probability, the starting probability refers to the probability that the character is used as a starting character of the field result of the preset attention information, and the ending probability refers to the probability that the character is used as an ending character of the field result of the preset attention information;
and determining a structured file corresponding to the first file according to the preset attention information and a field result corresponding to the preset attention information.
In a possible design, the obtaining, according to probability information of each character in the target character string, a field result corresponding to preset attention information in the target character string includes:
obtaining the initial probability and the end probability of each character according to each character in the target character string and preset attention information;
acquiring a first target character with the highest starting probability and a second target character with the highest finishing probability according to the starting probability and the finishing probability of each character;
and taking a character string composed of the first target character, the second target character and a third target character included between the first target character and the second target character as a field result of the preset attention information.
In one possible design, the obtaining a start probability and an end probability of each character according to each character in the target character string and preset attention information includes:
aiming at any one character, at least one text characteristic data of the character is obtained, and the at least one text characteristic data respectively corresponds to a starting probability coefficient and an ending probability coefficient of the preset attention information;
multiplying each text characteristic data by the initial probability coefficient of each text characteristic data to obtain a first processing result of each text characteristic data, and multiplying each text characteristic data by the ending probability coefficient of each text characteristic data to obtain a second processing result of each text characteristic data;
adding the first processing results of the text characteristic data to obtain a third processing result, and adding the second processing results of the text characteristic data to obtain a fourth processing result;
and normalizing the third processing result to obtain the initial probability of the character, and normalizing the fourth processing result to obtain the end probability of the character.
In one possible design, the sorting the at least one text unit according to the position information of the at least one text unit includes:
acquiring the layout type of the first file according to the position information of each text unit, wherein the layout type comprises a left layout, a right layout, an upper layout and a lower layout;
determining the arrangement sequence of the text units according to the layout type of the first file;
and sequencing the at least one text unit according to the arrangement sequence of the text units.
In one possible design, if the layout type is an upper layout and a lower layout, the arrangement sequence of the text units is from top to bottom;
and if the layout type is a left and right layout, the arrangement sequence of the text units is the text unit of the left layout and the text unit of the right layout, wherein the text unit of the left layout and the text unit of the right layout are arranged from top to bottom respectively.
In one possible design, after obtaining, from the start probability and the end probability of each character, a first target character with the highest start probability and a second target character with the highest end probability, the method further includes:
judging whether the starting probability of the first target character and the ending probability of the second target character are larger than a preset threshold value or not;
if not, determining that the field result of the preset attention information does not exist in the target character string.
In a possible design, after obtaining a field result corresponding to preset attention information in the target character string according to probability information of each character in the target character string, the method further includes:
acquiring a preset format corresponding to the preset attention information;
judging whether a field result corresponding to the preset concern information meets the preset format or not;
and if so, determining the field result as the field result corresponding to the preset attention information.
In a second aspect, an embodiment of the present invention provides a text structuring apparatus, including:
the device comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring text information of a first file, the text information comprises at least one text unit and position information of the at least one text unit, and the text unit comprises at least one character;
the sequencing module is used for sequencing the at least one text unit according to the position information of the at least one text unit to obtain a target character string, and the target character string comprises characters in each text unit;
the obtaining module is further configured to obtain, in the target character string, a field result corresponding to preset attention information according to probability information of each character in the target character string, where the preset attention information is used to indicate information required by a structured first file, the probability information includes a start probability and an end probability, the start probability refers to a probability that the character is used as a start character of the field result of the preset attention information, and the end probability refers to a probability that the character is used as an end character of the field result of the preset attention information;
and the determining module is used for determining the structured file corresponding to the first file according to the preset attention information and the field result corresponding to the preset attention information.
In one possible design, the obtaining module is specifically configured to:
obtaining the initial probability and the end probability of each character according to each character in the target character string and preset attention information;
acquiring a first target character with the highest starting probability and a second target character with the highest finishing probability according to the starting probability and the finishing probability of each character;
and taking a character string composed of the first target character, the second target character and a third target character included between the first target character and the second target character as a field result of the preset attention information.
In one possible design, the obtaining module is specifically configured to:
aiming at any one character, at least one text characteristic data of the character is obtained, and the at least one text characteristic data respectively corresponds to a starting probability coefficient and an ending probability coefficient of the preset attention information;
multiplying each text characteristic data by the initial probability coefficient of each text characteristic data to obtain a first processing result of each text characteristic data, and multiplying each text characteristic data by the ending probability coefficient of each text characteristic data to obtain a second processing result of each text characteristic data;
adding the first processing results of the text characteristic data to obtain a third processing result, and adding the second processing results of the text characteristic data to obtain a fourth processing result;
and normalizing the third processing result to obtain the initial probability of the character, and normalizing the fourth processing result to obtain the end probability of the character.
In one possible design, the sorting module is specifically configured to:
acquiring the layout type of the first file according to the position information of each text unit, wherein the layout type comprises a left layout, a right layout, an upper layout and a lower layout;
determining the arrangement sequence of the text units according to the layout type of the first file;
and sequencing the at least one text unit according to the arrangement sequence of the text units.
In one possible design, if the layout type is an upper layout and a lower layout, the arrangement sequence of the text units is from top to bottom;
and if the layout type is a left and right layout, the arrangement sequence of the text units is the text unit of the left layout and the text unit of the right layout, wherein the text unit of the left layout and the text unit of the right layout are arranged from top to bottom respectively.
In one possible design, further comprising: a judgment module;
the judging module is used for judging whether the starting probability of the first target character and the ending probability of the second target character are larger than a preset threshold value or not after the first target character with the highest starting probability and the second target character with the highest ending probability are obtained from the starting probability and the ending probability of each character;
if not, determining that the field result of the preset attention information does not exist in the target character string.
In one possible design, the determining module is further configured to:
after a field result corresponding to preset attention information is obtained in the target character string according to the probability information of each character in the target character string, a preset format corresponding to the preset attention information is obtained;
judging whether a field result corresponding to the preset concern information meets the preset format or not;
and if so, determining the field result as the field result corresponding to the preset attention information.
In a third aspect, an embodiment of the present invention provides a text structuring device, including:
a memory for storing a program;
a processor for executing the program stored by the memory, the processor being adapted to perform the method as described above in the first aspect and any one of the various possible designs of the first aspect when the program is executed.
In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, including instructions, which, when executed on a computer, cause the computer to perform the method as described above in the first aspect and any one of various possible designs of the first aspect.
The embodiment of the invention provides a text structuring method and a text structuring device, wherein the method comprises the following steps: the method comprises the steps of obtaining text information of a first file, wherein the text information comprises at least one text unit and position information of the at least one text unit, and the text unit comprises at least one character. And sequencing the at least one text unit according to the position information of the at least one text unit to obtain a target character string, wherein the target character string comprises characters in each text unit. And acquiring a field result corresponding to preset concern information in the target character string according to the probability information of each character in the target character string, wherein the preset concern information is used for indicating information required by the structured first file, and the probability information is used for including a starting probability and an ending probability. And determining a structured file corresponding to the first file according to the preset attention information and a field result corresponding to the preset attention information. The text units are sequenced to obtain the target character strings, so that the ordered target character strings can be guaranteed, and then the field results corresponding to the preset attention fields are determined based on the probability information of each character in the ordered target character strings to obtain the structured files corresponding to the first files, so that complex regular expressions do not need to be defined, universality can be realized among various different texts, and the difficulty in realizing text structuring is reduced.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
Fig. 1 is a first flowchart of a text structuring method according to an embodiment of the present invention;
FIG. 2 is a text unit diagram of a method for structured text according to an embodiment of the present invention;
fig. 3 is a flowchart of a text structuring method according to an embodiment of the present invention;
fig. 4 is a schematic view of a layout type of a text structuring method according to an embodiment of the present invention;
fig. 5 is a field result diagram of a text structuring method according to an embodiment of the present invention;
fig. 6 is a first schematic structural diagram of a text structuring apparatus according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of a text structuring apparatus according to an embodiment of the present invention;
fig. 8 is a schematic hardware structure diagram of a text structuring device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a first flowchart of a text structuring method according to an embodiment of the present invention, and fig. 2 is a schematic diagram of a text unit of the text structuring method according to the embodiment of the present invention, which is described below with reference to fig. 1 and fig. 2, and as shown in fig. 1, the method includes:
s101, acquiring text information of the first file, wherein the text information comprises at least one text unit and position information of the at least one text unit, and the text unit comprises at least one character.
In this embodiment, the first file is a text file that needs to be structured, and the first file may be, for example, a resume file, a job file submitted by a system, a book file, and the like.
In a possible implementation manner, the first file may be, for example, a file in a picture format, such as bmp, jpg, png, or may also be a file in a text format, such as PDF, word, xml, or the like. Specifically, in this embodiment, different processing manners are adopted for the first file in the picture format and the first file in the text format, and the following description is respectively given:
first, a processing method of a first file in a picture format is described, for example, for a first file in a picture format, position information of each text unit in the first file may be first obtained through a detection network, where the detection network may include but is not limited to: fast RCNN, YOLO, SSD, and then obtaining the characters included in each text unit through a recognition network, wherein the recognition network includes but is not limited to: attention model (Attention) -based methods, word recognition methods, and the like.
For example, for a first file in a text format, parsing software may be directly adopted to obtain a text unit and position information of the text unit, or processing of the first file in the text format may be automatically implemented.
The following describes the division of the text unit and the location information of the text unit with reference to fig. 2 in an implementation manner in which the first file is a resume file, as shown in fig. 2, there is a first file 201 currently, which includes a name of a resume object: zhang san and some personal information of zhang san, at least one text unit is obtained from the first file, wherein the division of each text unit can be, for example, as shown in 202, and each text unit is identified by each rectangular box.
It is understood that there are many implementations of the division of the text unit, for example, in the example of fig. 2, the text unit 203 is included, where the content of the character included in the text unit 203 is "gender: men ", in an alternative embodiment, for example, the character content may also be divided into 3 text units, respectively" gender ",": "," men ", wherein the specific division result of the text unit depends on the processing mode of the first file, which can be set according to actual requirements.
In this embodiment, the location information of each text unit may be described in a coordinate manner, as shown in fig. 2, if the division of the text unit 204 includes 4 vertices, the location information of the text unit 204 may be indicated by using the coordinates of the 4 vertices, and the detailed implementation of the location information may be, for example, indicated by 205, the location information of the text unit 204 (zhang san) may be indicated by using the coordinates of the 4 vertices (i.e., 8 numerical values). In an optional embodiment, for example, the coordinates of the center position of the text unit may also be used to indicate the position information of the text unit, which is not limited in this embodiment, and as long as the position of each text unit in the first file can be indicated, the specific implementation manner of the position information may be selected according to actual requirements.
If the position information is indicated by using a coordinate manner, a coordinate system is necessarily divided, and in a possible implementation manner, for example, a rectangular coordinate system may be established by using an upper left corner of the first file as an origin, or a central point of the first file may also be used as an origin, which is not limited in this embodiment, for example, a general coordinate system may be established among the first files, or a specific coordinate system of a resume of each first file may also be selected according to actual requirements.
S102, sequencing at least one text unit according to the position information of the at least one text unit to obtain a target character string, wherein the target character string comprises characters in each text unit.
In this embodiment, in order to ensure the ordering and relevance of text units when the text units are subsequently processed, the text units need to be sorted, specifically, after the file information of the first file is obtained, the position information of each text unit can be determined, and the arrangement order of the text units can be determined according to the position information of the text units.
For example, in the case where the first file is a resume, although the format of the resume is very many, the case where the upper side is a title and the lower side is a content is always presented as a whole, and therefore, for example, the text units may be sorted in the order from top to bottom; or, for a resume with a unique style, for example, a resume includes multiple polygons, each polygon includes one piece of information, one polygon may be regarded as one text unit, and the ordering of the text units may be random, because it is not confused.
Or, for the case that the first file is a book, the first file may be normally ordered from top to bottom and from left to right, and those skilled in the art can understand that the specific implementation manner for ordering the text units may be selected according to actual requirements, and the ordering is performed for example to avoid occurrence of "mobile phone number: the unordered situation of zhang san occurs, and various possible sorting modes of the unordered situation can be expanded according to the position information of the text unit, for example, a required sorting rule or algorithm can be selected to determine the version type, which is not described herein again.
Each text unit comprises at least one character, and after the text units are sequenced, the character strings contained in the text units are sequenced, so that the target character string is obtained.
S103, according to probability information of each character in the target character string, a field result corresponding to preset attention information is obtained in the target character string, wherein the preset attention information is used for indicating information required by the structured first file, the probability information comprises a starting probability and an ending probability, the starting probability refers to the probability that the character is used as the starting character of the field result of the preset attention information, and the ending probability refers to the probability that the character is used as the ending character of the field result of the preset attention information.
The preset attention information is used to indicate information required by the structured first file, for example, the first file is a resume, and the preset attention information may include, for example: name, mobile phone number, mailbox, local school, etc., namely what information is needed for the structured establishing file, and what the preset attention information includes; or, for the first file is a book, the preset attention information may include, for example: introduction, abstract, chapter directory, text, etc., it can be understood that the definition of the preset attention information depends on what information the structured first file is interested in, and the specific implementation manner thereof can be selected according to actual requirements.
Specifically, in this embodiment, a field result corresponding to the preset attention information is obtained in the target character string, for example, if the preset attention information currently exists as "name", the field result corresponding to the "name" is obtained in the target character string, and the specific obtaining manner is as follows: firstly, the character probability of each character in the target character string as the character in the field result of the name is obtained, and then the field result is determined according to the probability.
In a possible implementation manner, for example, the probability that each character is used as the start character and the probability that each character is used as the end character may be obtained, so that two characters with the highest probability values are respectively selected as the start character and the end character of the field result corresponding to the current preset attention information, and the field result is obtained.
In another possible implementation manner, for example, the probability that each character is used as a character in the field result may be directly obtained, wherein a plurality of characters with probability values before the preset ranking are selected to form a character string, the character string is used as the field result, and then a specific implementation manner of obtaining the field result corresponding to the preset attention information in the target character string may be selected according to actual requirements, which is not particularly limited herein.
And S104, determining a structured file corresponding to the first file according to the preset attention information and the field result corresponding to the preset attention information.
For example, the preset attention information and the field result corresponding to the preset attention information can be combined together in a key-value pair mode, so as to determine a structured file corresponding to the first file; or the preset attention information and the field result corresponding to the preset attention information may also be stored in an array manner, which is not limited in this embodiment as long as it is ensured that the field results corresponding to the preset attention information and the preset regulation information are associated together.
The text structuring method provided by the embodiment of the invention comprises the following steps: the method comprises the steps of obtaining text information of a first file, wherein the text information comprises at least one text unit and position information of the at least one text unit, and the text unit comprises at least one character. And sequencing the at least one text unit according to the position information of the at least one text unit to obtain a target character string, wherein the target character string comprises characters in each text unit. And acquiring a field result corresponding to preset attention information in the target character string according to the probability information of each character in the target character string, wherein the probability information comprises a starting probability and an ending probability. And determining a structured file corresponding to the first file according to the preset attention information and a field result corresponding to the preset attention information. The text units are sequenced to obtain the target character strings, so that the ordered target character strings can be guaranteed, and then the field results corresponding to the preset attention fields are determined based on the probability information of each character in the ordered target character strings to obtain the structured files corresponding to the first files, so that complex regular expressions do not need to be defined, universality can be realized among various different texts, and the difficulty in realizing text structuring is reduced.
On the basis of the foregoing embodiments, the text structuring method provided in the embodiment of the present invention is further described in detail below with reference to fig. 3 to fig. 5, fig. 3 is a second flowchart of the text structuring method provided in the embodiment of the present invention, fig. 4 is a schematic layout type diagram of the text structuring method provided in the embodiment of the present invention, and fig. 5 is a schematic field result diagram of the text structuring method provided in the embodiment of the present invention.
As shown in fig. 3, the method includes:
s301, acquiring text information of the first file, wherein the text information comprises at least one text unit and position information of the at least one text unit, and the text unit comprises at least one character.
The implementation of S301 is similar to S101, and is not described herein again.
S302, acquiring the layout type of the first file according to the position information of each text unit, wherein the layout type comprises a left layout, a right layout, an upper layout and a lower layout.
In this embodiment, the layout type of the first file may include a left-right layout, a top-bottom layout, and a left-right layout, where the left side of the first file is a module and includes a part of information, and the right side is a module and includes another part of information, and the first file in the left-right layout may be, for example, as shown in the first file 401 in fig. 4, the left side includes a name, a telephone, a mailbox, and the like, and the right side includes an educational background, an item experience, and the like.
The top and bottom layout means that the information of the first file is arranged from top to bottom, wherein the first file in the top and bottom layout may be, for example, as shown in the first file 402 in fig. 4, all the information is arranged from top to bottom, and no side column exists.
Specifically, the position information of each text unit indicates the position of each text unit in the first file, and then, for example, whether a text unit exists on the left side of the central area of the first file or not and the specific size of the text unit can be determined according to the position information, so as to determine whether the layout is left or right or top or bottom; or, whether the distance between the left and right of each text unit is greater than a preset threshold value or not can be judged, and if yes, the first file can be determined to be a left and right layout.
For example, for the first file illustrated in fig. 4, it may be determined whether a text unit exists on the left side of the text unit "education background", for example, see the first file 401, and when a text unit exists on the left side, it may be determined that the first file 401 is a left and right layout; referring to first file 402, when there is no text element on the left, then first file 402 may be determined to be in a top and bottom layout.
It will be understood by those skilled in the art that, in the case where the position information of each text unit has been determined, the detailed implementation manner for determining the layout type of the first file may be selected according to requirements, and may not be limited to the above-described top and bottom layouts, left and right layouts, and may also include, for example, a custom layout, an irregular layout, etc., which are not particularly limited herein, as long as the layout type of the first file is determined according to the position information of the text unit.
S303, determining the arrangement sequence of each text unit according to the layout type of the first file.
Specifically, if the layout type is a top-bottom layout, the arrangement sequence of each text unit is from top to bottom; if the layout type is a left and right layout, the arrangement sequence of the text units is the text unit of the left layout and the text unit of the right layout, wherein the text unit of the left layout and the text unit of the right layout are arranged from top to bottom respectively.
S304, sequencing at least one text unit according to the arrangement sequence of each text unit to obtain a target character string, wherein the target character string comprises characters in each text unit.
And sequencing at least one text unit according to the arrangement sequence of the text units, namely obtaining a target character string, wherein the target character string can effectively ensure the ordering of the characters in the text units.
For example, for the first file of the left and right layouts, the text units of the left layout and the text units of the right layout cannot be mixed together according to the arrangement sequence of the text units of the left layout and the text units of the right layout, if the first file of the left and right layouts is output from top to bottom, referring to the first file 401 in fig. 4, the text units 'names' may appear in 'project experience', various information may be disordered, and the ordered target character strings with association in front and at back can be effectively obtained by determining the arrangement sequence of the text units according to the layout type of the first file.
S305, obtaining the starting probability and the ending probability of each character according to each character in the target character string and preset attention information.
In this embodiment, the probability information of each character includes a start probability and an end probability of each character, where the start probability refers to a probability that a character is a start character of a field result of the preset attention information, and the end probability refers to a probability that a character is an end character of a field result of the preset attention information.
For any character, the probability that the character is used as the initial character of the field result of the preset attention information is obtained, that is, the initial probability of the character is obtained, and the probability that the character is used as the end character of the field result of the preset attention information is obtained, that is, the end probability of the character is obtained.
The following describes a specific implementation manner for obtaining the start probability and the end probability of each character:
at least one text feature data of the character is acquired for any character, and the at least one text feature data respectively correspond to a start probability coefficient and an end probability coefficient of preset attention information.
Specifically, the text feature data of the character may be, for example, an M-dimensional text feature, where M is an integer, and as for the content of the specific text feature of each dimension, the content may be obtained in a deep learning manner, for example, as implemented by technologies such as LSTM/BERT, a user or a developer may not specify which content the specific text feature data includes, and for different preset attention information, the text feature data of the character is the same, and in order to ensure that for different preset attention information, the character pair has different start probabilities and end probabilities, each text feature data of the character has a start probability coefficient and an end probability coefficient corresponding to the preset attention information.
For example, if the current character is "one", because it is a surname chinese character, the probability of its initial character as the field result of "name" is relatively high, and the initial probability coefficient of each text feature data of the character corresponding to "name" is also relatively high, so as to ensure that the character can obtain a relatively high real probability; the character "sheet" is not possible to be the initial character of the field result of the "mobile phone number", so the initial probability coefficient of each text feature data of the character corresponding to the "mobile phone number" is necessarily lower or 0. Those skilled in the art will understand that the start probability coefficient and the end probability coefficient of each text feature data corresponding to the preset attention information can also be obtained by means of deep learning, and a user or a developer does not need to actually obtain the coefficient values.
And multiplying the text characteristic data by the initial probability coefficient of each text characteristic data to obtain a first processing result of each text characteristic data, and multiplying the text characteristic data by the ending probability coefficient of each text characteristic data to obtain a second processing result of each text characteristic data.
And adding the first processing results of the text characteristic data to obtain a third processing result, and adding the second processing results of the text characteristic data to obtain a fourth processing result.
And normalizing the third processing result to obtain the initial probability of the character, and normalizing the fourth processing result to obtain the end probability of the character.
In this embodiment, each text feature data of a character corresponds to a respective numerical value, for example, the text feature data is multiplied by an initial probability coefficient corresponding to the text feature data, the results obtained after multiplying each text feature data are added, and the added results are normalized to obtain an initial probability of the character corresponding to preset attention information; and multiplying the text characteristic data by the ending probability coefficient corresponding to the text characteristic data, adding the results after the multiplication of the text characteristic data, and carrying out normalization processing on the added results to obtain the ending probability of the character corresponding to the preset attention information. The above-mentioned obtaining of the start probability and the end probability may be implemented by mapping of a full connection layer, for example.
As will be understood by those skilled in the art, the purpose of the normalization process is to limit the start probability and the end probability to 0-1, thereby facilitating subsequent processes according to the start probability and the end probability to improve the operation efficiency.
S306, according to the starting probability and the ending probability of each character, a first target character with the highest starting probability and a second target character with the highest ending probability are obtained.
After the starting probability and the ending probability of each character are determined, selecting a first target character with the highest starting probability according to the starting probabilities, wherein the first target character is the starting character of the field result of the current preset attention information, and selecting a second target character with the highest ending probability, wherein the second target character is the ending character of the field result of the current preset attention information.
In an optional embodiment, after obtaining a first target character with a highest start probability and a second target character with a highest end probability, it is further required to determine whether the start probability of the first target character and the end probability of the second target character are greater than a preset threshold.
For example, if the starting probabilities and the ending probabilities of all the current characters are distributed around 0.05, it is indicated that the probabilities of the current characters as the starting characters and the ending characters are very low, and the probability corresponding to the selected first target character may also be only 0.052, and by comparing the starting probability of the first target character and the ending probability of the second target character with a preset threshold, if it is determined that any one of the starting probability of the first target character and the ending probability of the second target character is not greater than the preset threshold, it may be determined that a field result of preset attention information does not exist in the target character string, for example, a mailbox account needs to be currently acquired, but a mailbox is not included in the resume at all, and cannot be acquired, where the preset threshold may be selected according to actual needs, and this embodiment does not limit this.
The first target character and the second target character are screened by setting a preset threshold, so that the accuracy of the obtained field result can be effectively ensured.
S307, taking a character string composed of the first target character, the second target character and a third target character included between the first target character and the second target character as a field result of the preset attention information.
The first target character and the second target character respectively correspond to a start character and an end character of a field result of the preset attention information, and a third character included between the first target character and the second target character is also a character in the field result of the preset attention information.
A specific example is described below with reference to fig. 5, and as shown in fig. 5, it is assumed that the target character string obtained from the first file 401 in fig. 4 is "zhang san ideal position: product development contacts my 1234567890 mailbox: 12345678@ qq.com … ″, where the preset attention information to be currently acquired is a name, a mobile phone, and a mailbox, respectively calculating respective start probabilities and end probabilities of characters in the target character string corresponding to different preset attention information, and then, for each preset attention information, respectively selecting a probability value that is the highest among the start probabilities and the end probabilities, where a shaded identifier illustrated in fig. 5 indicates that the probability value is the highest. Then the following preset attention information and corresponding field result can be obtained:
name: zhang three
The mobile phone number is as follows: 1234567890
Mail box: 12345678@ qq
In an optional embodiment, after obtaining the field result corresponding to the preset attention information, a preset format corresponding to the preset attention information may also be obtained, where different preset attention information corresponds to different preset formats, for example, for a preset attention information "name", the preset format may be, for example: no special symbol is included; for example, for the preset focus information "mobile phone number", the preset format may be, for example: the characters include 11, all characters are numbers, and the preset format corresponding to the preset attention information is set according to the specific content of the preset attention information, which is not limited herein.
And judging whether the field result corresponding to the preset attention information meets a preset format, if so, determining the field result as the field result corresponding to the preset attention information. If the judgment is not satisfied, determining that no field result corresponding to the preset attention information exists currently; or deleting the current field result, and retrieving a new field result from the remaining target strings, which is not limited in this embodiment.
By judging whether the field result meets the preset format or not, the accuracy of the obtained field result can be effectively ensured.
S308, determining a structured file corresponding to the first file according to the preset attention information and the field result corresponding to the preset attention information.
The implementation of S308 is similar to S104, and is not described here again.
The text structuring method provided by the embodiment of the invention comprises the following steps: the method comprises the steps of obtaining text information of a first file, wherein the text information comprises at least one text unit and position information of the at least one text unit, and the text unit comprises at least one character. And acquiring the layout type of the first file according to the position information of each text unit, wherein the layout type is a left-right layout or an upper-lower layout. And determining the arrangement sequence of each text unit according to the layout type of the first file. And sequencing at least one text unit according to the arrangement sequence of each text unit to obtain a target character string, wherein the target character string comprises characters in each text unit. And obtaining a starting probability and an ending probability of each character according to each character in the target character string and preset attention information, wherein the starting probability refers to the probability that the character is used as the starting character of a field result of the preset attention information, and the ending probability refers to the probability that the character is used as the ending character of the field result of the preset attention information. And acquiring a first target character with the highest starting probability and a second target character with the highest finishing probability according to the starting probability and the finishing probability of each character. And taking a character string consisting of the first target character, the second target character and a third target character included between the first target character and the second target character as a field result of the preset attention information. And determining a structured file corresponding to the first file according to the preset attention information and a field result corresponding to the preset attention information. The arrangement sequence of the text units is determined according to the layout type of the first file, so that the orderliness and the logicality of the output target character strings can be guaranteed, the disorder of information is avoided, the field result of the preset attention information is determined according to the starting probability and the ending probability of each character, the accuracy of field result determination can be guaranteed, and meanwhile, the complicated work of defining a regular expression is avoided.
Fig. 6 is a first schematic structural diagram of a text structuring apparatus according to an embodiment of the present invention. As shown in fig. 6, the apparatus 60 includes: an obtaining module 601, a sorting module 602, and a determining module 603.
An obtaining module 601, configured to obtain text information of a first file, where the text information includes at least one text unit and location information of the at least one text unit, and the text unit includes at least one character;
a sorting module 602, configured to sort the at least one text unit according to the position information of the at least one text unit to obtain a target character string, where the target character string includes characters in each text unit;
the obtaining module 601 is further configured to obtain, according to probability information of each character in the target character string, a field result corresponding to preset attention information in the target character string, where the preset attention information is used to indicate information required by a structured first file, and the probability information is obtained, where a start probability refers to a probability that a character is a start character of the field result of the preset attention information, and an end probability refers to a probability that a character is an end character of the field result of the preset attention information;
a determining module 603, configured to determine, according to the preset attention information and a field result corresponding to the preset attention information, a structured file corresponding to the first file.
In one possible design, the obtaining module 601 is specifically configured to:
obtaining the initial probability and the end probability of each character according to each character in the target character string and preset attention information;
acquiring a first target character with the highest starting probability and a second target character with the highest finishing probability according to the starting probability and the finishing probability of each character;
and taking a character string composed of the first target character, the second target character and a third target character included between the first target character and the second target character as a field result of the preset attention information.
In one possible design, the obtaining module 601 is specifically configured to:
aiming at any one character, at least one text characteristic data of the character is obtained, and the at least one text characteristic data respectively corresponds to a starting probability coefficient and an ending probability coefficient of the preset attention information;
multiplying each text characteristic data by the initial probability coefficient of each text characteristic data to obtain a first processing result of each text characteristic data, and multiplying each text characteristic data by the ending probability coefficient of each text characteristic data to obtain a second processing result of each text characteristic data;
adding the first processing results of the text characteristic data to obtain a third processing result, and adding the second processing results of the text characteristic data to obtain a fourth processing result;
and normalizing the third processing result to obtain the initial probability of the character, and normalizing the fourth processing result to obtain the end probability of the character.
In one possible design, the sorting module 602 is specifically configured to:
acquiring the layout type of the first file according to the position information of each text unit, wherein the layout type comprises a left layout, a right layout, an upper layout and a lower layout;
determining the arrangement sequence of the text units according to the layout type of the first file;
and sequencing the at least one text unit according to the arrangement sequence of each text unit.
In one possible design, if the layout type is an upper layout and a lower layout, the arrangement sequence of the text units is from top to bottom;
and if the layout type is a left and right layout, the arrangement sequence of the text units is the text unit of the left layout and the text unit of the right layout, wherein the text unit of the left layout and the text unit of the right layout are arranged from top to bottom respectively.
The apparatus provided in this embodiment may be used to implement the technical solutions of the above method embodiments, and the implementation principles and technical effects are similar, which are not described herein again.
Fig. 7 is a schematic structural diagram of a text structuring apparatus according to an embodiment of the present invention. As shown in fig. 7, this embodiment further includes, on the basis of the embodiment in fig. 6: and a decision block 704.
In a possible design, the determining module 704 is configured to determine whether a starting probability of the first target character and an ending probability of the second target character are greater than a preset threshold value after obtaining, from the starting probability and the ending probability of each character, the first target character with the highest starting probability and the second target character with the highest ending probability;
if not, determining that the field result of the preset attention information does not exist in the target character string.
In one possible design, the determining module 704 is further configured to:
after a field result corresponding to preset attention information is obtained in the target character string according to the probability information of each character in the target character string, a preset format corresponding to the preset attention information is obtained;
judging whether a field result corresponding to the preset concern information meets the preset format or not;
and if so, determining the field result as the field result corresponding to the preset attention information.
The apparatus provided in this embodiment may be used to implement the technical solutions of the above method embodiments, and the implementation principles and technical effects are similar, which are not described herein again.
Fig. 8 is a schematic diagram of a hardware structure of a text structuring device according to an embodiment of the present invention, and as shown in fig. 8, a text structuring device 80 according to this embodiment includes: a processor 801 and a memory 802; wherein
A memory 802 for storing computer-executable instructions;
the processor 801 is configured to execute the computer-executable instructions stored in the memory to implement the steps performed by the text structuring method in the above embodiments. Reference may be made in particular to the description relating to the method embodiments described above.
Alternatively, the memory 802 may be separate or integrated with the processor 801.
When the memory 802 is provided separately, the text structuring device further comprises a bus 803 for connecting the memory 802 and the processor 801.
The embodiment of the present invention further provides a computer-readable storage medium, where computer-executable instructions are stored in the computer-readable storage medium, and when a processor executes the computer-executable instructions, the text structuring method performed by the text structuring device is implemented.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice, for example, a plurality of modules may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.
The integrated module implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present application.
It should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor, or in a combination of the hardware and software modules within the processor.
The memory may comprise a high-speed RAM memory, and may further comprise a non-volatile storage NVM, such as at least one disk memory, and may also be a usb disk, a removable hard disk, a read-only memory, a magnetic or optical disk, etc.
The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, the buses in the figures of the present application are not limited to only one bus or one type of bus.
The storage medium may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for text structuring, comprising:
acquiring text information of a first file, wherein the text information comprises at least one text unit and position information of the at least one text unit, and the text unit comprises at least one character;
sequencing the at least one text unit according to the position information of the at least one text unit to obtain a target character string, wherein the target character string comprises characters in each text unit;
acquiring a field result corresponding to preset attention information in the target character string according to probability information of each character in the target character string, wherein the preset attention information is used for indicating information required by a structured first file, the probability information comprises a starting probability and an ending probability, the starting probability refers to the probability that the character is used as a starting character of the field result of the preset attention information, and the ending probability refers to the probability that the character is used as an ending character of the field result of the preset attention information;
and determining a structured file corresponding to the first file according to the preset attention information and a field result corresponding to the preset attention information.
2. The method according to claim 1, wherein the obtaining a field result corresponding to preset attention information in the target character string according to the probability information of each character in the target character string comprises:
obtaining the initial probability and the end probability of each character according to each character in the target character string and preset attention information;
acquiring a first target character with the highest starting probability and a second target character with the highest finishing probability according to the starting probability and the finishing probability of each character;
and taking a character string composed of the first target character, the second target character and a third target character included between the first target character and the second target character as a field result of the preset attention information.
3. The method according to claim 2, wherein obtaining a start probability and an end probability of each character according to each character in the target character string and preset attention information comprises:
aiming at any character, at least one text characteristic data of the character is obtained, and the at least one text characteristic data respectively corresponds to a starting probability coefficient and an ending probability coefficient of the preset attention information;
multiplying each text characteristic data by the initial probability coefficient of each text characteristic data to obtain a first processing result of each text characteristic data, and multiplying each text characteristic data by the ending probability coefficient of each text characteristic data to obtain a second processing result of each text characteristic data;
adding the first processing results of the text characteristic data to obtain a third processing result, and adding the second processing results of the text characteristic data to obtain a fourth processing result;
and normalizing the third processing result to obtain the initial probability of the character, and normalizing the fourth processing result to obtain the end probability of the character.
4. The method of claim 1, wherein the sorting the at least one text unit according to the position information of the at least one text unit comprises:
acquiring the layout type of the first file according to the position information of each text unit, wherein the layout type comprises a left layout, a right layout, an upper layout and a lower layout;
determining the arrangement sequence of the text units according to the layout type of the first file;
and sequencing the at least one text unit according to the arrangement sequence of the text units.
5. The method of claim 2, wherein after obtaining a first target character with a highest start probability and a second target character with a highest end probability according to the start probability and the end probability of each character, the method further comprises:
judging whether the starting probability of the first target character and the ending probability of the second target character are larger than a preset threshold value or not;
if not, determining that the field result of the preset attention information does not exist in the target character string.
6. The method according to claim 1, wherein after obtaining a field result corresponding to preset attention information in the target character string according to probability information of each character in the target character string, the method further comprises:
acquiring a preset format corresponding to the preset attention information;
judging whether a field result corresponding to the preset concern information meets the preset format or not;
and if so, determining the field result as the field result corresponding to the preset attention information.
7. A text structuring apparatus, comprising:
the device comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring text information of a first file, the text information comprises at least one text unit and position information of the at least one text unit, and the text unit comprises at least one character;
the sequencing module is used for sequencing the at least one text unit according to the position information of the at least one text unit to obtain a target character string, and the target character string comprises characters in each text unit;
the obtaining module is further configured to obtain, in the target character string, a field result corresponding to preset attention information according to probability information of each character in the target character string, where the preset attention information is used to indicate information required by a structured first file, the probability information includes a start probability and an end probability, the start probability refers to a probability that the character is used as a start character of the field result of the preset attention information, and the end probability refers to a probability that the character is used as an end character of the field result of the preset attention information;
and the determining module is used for determining the structured file corresponding to the first file according to the preset attention information and the field result corresponding to the preset attention information.
8. The apparatus of claim 7, wherein the obtaining module is specifically configured to:
obtaining the initial probability and the end probability of each character according to each character in the target character string and preset attention information;
acquiring a first target character with the highest starting probability and a second target character with the highest finishing probability according to the starting probability and the finishing probability of each character;
and taking a character string composed of the first target character, the second target character and a third target character included between the first target character and the second target character as a field result of the preset attention information.
9. A text structuring device, comprising:
a memory for storing a program;
a processor for executing the program stored by the memory, the processor being configured to perform the method of any of claims 1 to 6 when the program is executed.
10. A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method of any one of claims 1 to 6.
CN201911243760.4A 2019-12-06 2019-12-06 Text structuring method and device Active CN112925837B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911243760.4A CN112925837B (en) 2019-12-06 2019-12-06 Text structuring method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911243760.4A CN112925837B (en) 2019-12-06 2019-12-06 Text structuring method and device

Publications (2)

Publication Number Publication Date
CN112925837A CN112925837A (en) 2021-06-08
CN112925837B true CN112925837B (en) 2022-08-02

Family

ID=76161781

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911243760.4A Active CN112925837B (en) 2019-12-06 2019-12-06 Text structuring method and device

Country Status (1)

Country Link
CN (1) CN112925837B (en)

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105988991B (en) * 2015-02-26 2019-01-18 阿里巴巴集团控股有限公司 A kind of recognition methods, device and the server of the affiliated languages of surname
CN108062302B (en) * 2016-11-08 2019-03-26 北京国双科技有限公司 A kind of recognition methods of text information and device
CN110196968B (en) * 2019-06-06 2023-04-07 北京林业大学 System and method for automatically identifying simplified Chinese coding mode based on specific character string search
CN110490193B (en) * 2019-07-24 2022-11-08 西安网算数据科技有限公司 Single character area detection method and bill content identification method

Also Published As

Publication number Publication date
CN112925837A (en) 2021-06-08

Similar Documents

Publication Publication Date Title
US9747305B2 (en) Image search device, image search method, program, and computer-readable storage medium
CN107704512B (en) Financial product recommendation method based on social data, electronic device and medium
EP2833275B1 (en) Image search device, image search method, program, and computer-readable storage medium
CN110489449B (en) Chart recommendation method and device and electronic equipment
US10528649B2 (en) Recognizing unseen fonts based on visual similarity
US11216658B2 (en) Utilizing glyph-based machine learning models to generate matching fonts
CN110245557A (en) Image processing method, device, computer equipment and storage medium
CN113449187A (en) Product recommendation method, device and equipment based on double portraits and storage medium
KR20190128246A (en) Searching methods and apparatus and non-transitory computer-readable storage media
CN110837559B (en) Statement sample set generation method, electronic device and storage medium
CN110427496B (en) Knowledge graph expansion method and device for text processing
CN112925837B (en) Text structuring method and device
CN109670183B (en) Text importance calculation method, device, equipment and storage medium
CN110895654B (en) Segmentation method, segmentation system and non-transitory computer readable medium
US11886809B1 (en) Identifying templates based on fonts
CN116225956A (en) Automated testing method, apparatus, computer device and storage medium
CN113705201B (en) Text-based event probability prediction evaluation algorithm, electronic device and storage medium
WO2019136920A1 (en) Presentation method for visualization of topic evolution, application server, and computer readable storage medium
US20240220522A1 (en) Data display method, device, computer apparatus and system
CN117216217B (en) Intelligent classification and retrieval method for files
CN111680513B (en) Feature information identification method and device and computer readable storage medium
JP4328511B2 (en) Pattern recognition apparatus, pattern recognition method, program, and storage medium
CN117743499A (en) Binary heap construction method, device, equipment and medium
KR20240029945A (en) Method, computer device, and computer program for item ledger platform
CN115661844A (en) Model training and form information extraction method and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant