CN117058699A

CN117058699A - Resume layout dividing method, system and storage medium based on LayoutLMv3 model

Info

Publication number: CN117058699A
Application number: CN202311087110.1A
Authority: CN
Inventors: 李敬泉; 徐雯; 胡伟; 徐伟招; 郑德乐
Original assignee: Shenzhen Kuakua Jingling Technology Co ltd
Current assignee: Shenzhen Kuakua Jingling Technology Co ltd
Priority date: 2023-08-28
Filing date: 2023-08-28
Publication date: 2023-11-14
Anticipated expiration: 2043-08-28
Also published as: CN117058699B

Abstract

The invention discloses a resume layout dividing method based on a LayoutLMv3 model, which comprises the following steps: s1: fine tuning the LayoutLMv3 target detection model based on the self-labeling resume; s2: reasoning the non-labeling resume based on the finely-adjusted LayoutLMv3 target detection model to acquire the title position information of the non-labeling resume; s3: and (3) dividing the sections of the non-labeling resume based on the title position information of the non-labeling resume and an OCR recognition algorithm obtained in the step (S2), and extracting the text information in each section. The invention can improve the accuracy of dividing the resume layout and can more accurately embody the information organization form in the resume.

Description

Resume layout dividing method, system and storage medium based on LayoutLMv3 model

Technical Field

The invention relates to the technical field of resume analysis, in particular to a resume layout dividing method, a resume layout dividing system and a storage medium based on a LayoutLMv3 model.

Background

In recruitment, the recruiter needs to read the resume of the job seeker to screen whether the recruiter has the capability and experience of matching the job position, and the resume content is extracted in a structured mode according to the layout, so that the recruiter can quickly know personal information of the job seeker, and the recruitment resume screening efficiency is improved.

At present, the method for carrying out structural extraction on resume information is mainly carried out according to text keywords, for example, a resume data information analysis processing method proposed in CN 108874928A patent is a method for directly adopting keyword matching for the whole resume text content, but the method does not consider the influence of keywords in a text on a title, and has the possibility of causing layout division errors.

Disclosure of Invention

The invention aims to provide a resume layout dividing method, a system and a storage medium based on a LayoutLMv3 model, which are characterized in that the LayoutLMv3 model is applied to resume analysis, firstly, layout division is carried out on a fine granularity level through resume titles, the accuracy of resume layout division is improved, the information organization form in the resume can be more accurately reflected, the data is structured on the basis, the resume is convenient to use and store in a downstream task, the problem of difficult layout positioning and analysis in diversified resume analysis can be reduced, and meanwhile, layout areas of different resume can be accurately identified in a mode of combining image information and text semantic information acquired by image vision auxiliary titles, so that the accuracy and recall rate of the whole resume analysis are improved.

In order to achieve the above purpose, the following technical scheme is adopted:

a resume layout dividing method based on a LayoutLMv3 model comprises the following steps:

s1: fine tuning the LayoutLMv3 target detection model based on the self-labeling resume;

s2: reasoning the non-labeling resume based on the finely-adjusted LayoutLMv3 target detection model to acquire the title position information of the non-labeling resume;

s3: and (3) dividing the sections of the non-labeling resume based on the title position information of the non-labeling resume and an OCR recognition algorithm obtained in the step (S2), and extracting the text information in each section.

Further, the step S1 specifically includes the following steps:

s11: converting the resume into a picture format, dividing each title in the resume by using a rectangular frame, and representing the position of the rectangular frame where each title is located in the resume by using a four-tuple (x, y, box_width, box_height), wherein x represents the abscissa of the top left corner vertex of the rectangular frame, y represents the ordinate of the top left corner vertex of the rectangular frame, box_width represents the width of the rectangular frame, and box_height represents the height of the rectangular frame;

s12: marking the position information of the resume title in a four-element mode in S11, writing the marked information into a JSON file, inputting the marked information and the resume title into a LayoutLMv3 model together, so as to finely adjust the LayoutLMv3 model, and obtaining the finely adjusted model parameters.

Further, the step S2 specifically includes the following steps:

s21: converting the non-marked resume into a picture format, obtaining resume name, length and width information of the non-marked resume, storing the resume name, length and width information into a JSON format, and inputting the resume name, length and width information and resume picture information into a fine-tuned LayoutLMv3 target detection model;

s22: and loading the model parameters obtained in the S12, obtaining resume title position information of the non-labeling resume after model calculation reasoning, and storing the resume title position information in a JSON format.

Further, the step S3 specifically includes the following steps:

s31: the method comprises the steps of acquiring resume title position information in each resume according to a sequence from top to bottom, primarily dividing the resume into a plurality of sections according to the resume title position information, and simultaneously taking title text contents in each title section and text contents between the next resume title section adjacent to the title section as text contents of the title section;

s32: and extracting the text content in each title print based on an OCR (optical character recognition) algorithm, taking the first row of text in the extracted text content in each title print as the title of the print, and carrying out final print division by taking the title as a keyword.

Further, the dividing of the sections in S32 using the titles as keywords specifically includes the following steps:

s321: based on the layout and the content of the resume, the resume is divided into the following 7 sections in advance: BASIC information, working EXPERIENCE, educational background, project EXPERIENCE, self-evaluation, rewarding certificate and skill, wherein the plate labels corresponding to the 7 plates are BASIC_ INFORMATION, WORK _EXPERIENCE, EDUCATION BACKGROUND, PROJECT EXPERIENCE, SELF ASSESSMENT and REWARD_ CERTIFICATES, SKILL respectively;

s322: and for each edition, listing a keyword list, and for each detected resume title text content, matching keywords in the keyword list with the detected resume title text content, and dividing the title and the content thereof into edition corresponding to the keywords when any one keyword can be matched.

Further, in S322, for the basic information layout, if the real resume does not include the text corresponding to the layout, the content before the first title in the first page of the resume is used as the layout content.

Further, the resume layout dividing method based on the LayoutLMv3 model further comprises the following steps:

s4: and visually displaying the title detection result on the corresponding resume.

Further, the step S4 specifically includes: and (3) for the resume title position information obtained in the step (S2), drawing a rectangular frame where the title is located on a corresponding position in the resume by using a python programming language.

The system for dividing the resume print block based on the LayoutLMv3 model comprises a memory, a processor and a computer program which is stored in the memory and can run on the processor, wherein the processor realizes the resume print block dividing method when executing the computer program.

There is also provided a computer readable storage medium storing a computer program adapted to be loaded and executed by a processor to cause a computer device having the processor to perform the above-described method.

By adopting the scheme, the invention has the beneficial effects that:

1) The marked resume is used for fine tuning the LayoutLMv3 target detection model, the fine-tuned target detection model is used for reasoning the new resume, the dividing area of each edition block is found out, instead of converting the resume into a plain text for analysis just like a conventional resume analysis mode, the accuracy of edition block division of the diversified resume is higher, in the next analysis process, a named entity identification technology is used for analyzing the detailed information of each edition block, for example, the 'work experience' can be automatically presented in a segmented mode, and the information is ensured not to be lost basically;

2) The resume with various formats is converted into jpg picture format, and the text task is converted into visual task, so that the method can be suitable for resume data with various formats and sizes;

3) The method has wide application prospect, particularly in the human resource industry, can avoid the work of manually inputting system information, and can reduce the error rate of manual input.

Drawings

FIG. 1 is a flow chart diagram of the present invention;

FIG. 2 is a schematic diagram of resume labels according to an embodiment of the present invention;

FIG. 3 is a diagram showing the result of dividing the content of a resume layout according to an embodiment of the present invention.

Detailed Description

The invention will be described in detail below with reference to the drawings and the specific embodiments.

Referring to fig. 1 to 3, the invention provides a resume layout dividing method based on a LayoutLMv3 model, which comprises the following steps:

s1: and fine tuning the LayoutLMv3 target detection model based on the self-labeling resume.

The LayoutLMv3 target detection model is based on a transducer structure, and is trained on 1100 ten thousand scanned document images, and in order to better identify and segment resume data, the LayoutLMv3 target detection model adapting to resume data is obtained by performing fine adjustment on the LayoutLMv3 by using self-labeled resume data, and in one embodiment, the step of performing fine adjustment on the LayoutLMv3 target detection model is as follows:

In this step, mainly the LayoutLMv3 model is trimmed, in this embodiment, the resume is first converted into JPG picture format (which converts text task into visual task, can adapt to resume data of various formats and sizes, and stores images to save memory resources of a computer), then the positions of the titles of each layout in the resume in the picture are represented by a four-tuple (x, y, box_width, box_height), where x represents the abscissa of the top left corner vertex of the rectangular frame, y represents the ordinate of the top left corner vertex of the rectangular frame, box_width represents the width of the rectangular frame, and box_height represents the height of the rectangular frame; and then, the position information of the resume titles in the training data is in one-to-one correspondence with the resume according to the four-element organization mode, the marking information is written into a JSON file, the JSON file and the resume titles are input into a LayoutLMv3 model together, the model is finely adjusted and calculated by using a GPU, and the model is a data marking result of a single resume as shown in figure 2.

S2: and reasoning the non-labeling resume based on the finely-adjusted LayoutLMv3 target detection model to acquire the title position information of the non-labeling resume.

In one embodiment, the method specifically includes:

In this embodiment, title position information of the non-labeling resume is obtained by using a Layoutlmv3 target detection model obtained by fine tuning, after S1 is completed, a Layoutlmv3 model and related model parameters after fine tuning of a resume image are obtained, and for a new non-labeling resume, information such as names, lengths, widths and the like of resume pictures is required to be obtained, then the picture information is organized into a JSON format, taken as a model input together with the picture data, the model parameters are loaded into a neural network model, and a resume title coordinate position of model reasoning is obtained, wherein the format is JSON, so that a coordinate quadruple of a new resume title can be obtained.

In one embodiment, the method specifically includes:

This step is mainly aimed at achieving text information extraction, in this embodiment based on title location and OCR recognition algorithm, resulting in modular content of the resume. Typically, a resume contains multiple pictures, and the content is related across pages. Therefore, the title coordinates are ordered according to the resume page and the title frame ordinate, after the ordering is finished, the character content in the resume is obtained by using the OCR character recognition technology, the coordinates of the recognized characters are also given at the same time of OCR detection, and the content between two titles is judged to be the content of the last title layout block compared with the title ordinate. Then, the first line of characters of each plate are used as a title, the plate is divided into BASIC information, working experience, educational background, project experience, self-evaluation, rewarding certificate and 7 plates of SKILL, the plate labels corresponding to the 7 plates are BASIC_ INFORMATION, WORK _ EXPERIENCE, EDUCATION _ BACKGROUND, PROJECT _ EXPERIENCE, SELF _ ASSESSMENT, REWARD _ CERTIFICATES and SKILL respectively, for each plate, a keyword list is listed, for each detected resume title text content, keywords in the keyword list are matched with the detected resume title text content, any keyword can be matched, the title and the content thereof are divided into plates corresponding to the keywords, as shown in fig. 3, the resume layout text extraction is schematic, and the content of the same frame is one plate.

In addition, for the block of basic_information, since the real resume usually does not contain keywords, the present invention proposes that, for the resume which does not contain basic_information block keywords, the content before the first title in the first page of the resume is taken as the content of the block, and meanwhile, the present invention further includes step S4: the title detection result is visually displayed on the corresponding resume, and in an embodiment, the title detection result is specifically: and (3) for the resume title position information obtained in the step (S2), drawing a rectangular frame where the title is located on a corresponding position in the resume by using a python programming language so as to read and view.

In addition, a resume layout dividing system based on a LayoutLMv3 model is provided, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the computer program to realize the resume layout dividing method, and meanwhile, a computer readable storage medium is provided, and the computer readable storage medium stores the computer program, and the computer program is suitable for being loaded and executed by the processor to enable a computer device with the processor to execute the method.

The processor may be a central processing unit (Central Processing Unit, CPU), other general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. The general purpose processor may be a microprocessor or the processor may be any conventional processor, etc., and the memory may be a hard disk, a computer self-contained memory, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc.

The computer readable storage medium may be an internal storage unit of a computer device, such as a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) card, a flash card (flash card) or the like, which are provided on the computer device. Further, the computer-readable storage medium may also include both internal storage units and external storage devices of the computer device. The computer-readable storage medium is used to store the computer program and other programs and data required by the computer device. The computer-readable storage medium may also be used to temporarily store data that has been output or is to be output.

The foregoing description of the preferred embodiment of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims

1. A resume layout dividing method based on a LayoutLMv3 model is characterized by comprising the following steps:

2. The resume layout dividing method based on the LayoutLMv3 model according to claim 1, wherein the S1 specifically comprises the following steps:

3. The resume layout dividing method based on the LayoutLMv3 model according to claim 2, wherein the step S2 specifically comprises the following steps:

4. The resume layout dividing method based on the LayoutLMv3 model according to claim 1, wherein the step S3 specifically comprises the following steps:

5. The resume layout dividing method based on the LayoutLMv3 model according to claim 4, wherein the step of dividing the layout by using the title as the keyword in S32 specifically comprises the following steps:

s321: based on the layout and the content of the resume, the resume is divided into the following 7 sections in advance: BASIC information, working experience, educational background, project experience, self-evaluation, rewarding certificate and skill, wherein the plate labels corresponding to the 7 plates are BASIC_ INFORMATION, WORK _ EXPERIENCE, EDUCATION _ BACKGROUND, PROJECT _ EXPERIENCE, SELF _ ASSESSMENT, REWARD _ CERTIFICATES, SKILL;

6. The method for dividing a layout of a resume based on a LayoutLMv3 model according to claim 5, wherein in S322, for a basic information layout, if the real resume does not include text corresponding to the layout, the content before the first title in the first page of the resume is taken as the layout content.

7. The resume layout dividing method based on the LayoutLMv3 model according to claim 3, wherein the resume layout dividing method based on the LayoutLMv3 model further comprises the following steps:

8. The resume layout dividing method based on the LayoutLMv3 model according to claim 7, wherein the S4 specifically is: and (3) for the resume title position information obtained in the step (S2), drawing a rectangular frame where the title is located on a corresponding position in the resume by using a python programming language.

9. A resume layout dividing system based on a LayoutLMv3 model, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the resume layout dividing method according to any one of claims 1 to 8 when executing the computer program.

10. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program adapted to be loaded and executed by a processor to cause a computer device having the processor to perform the method of any one of claims 1 to 8.