CN117058699B

CN117058699B - Resume layout dividing method, system and storage medium based on LayoutLMv model

Info

Publication number: CN117058699B
Application number: CN202311087110.1A
Authority: CN
Inventors: 李敬泉; 徐雯; 胡伟; 徐伟招; 郑德乐
Original assignee: Shenzhen Kuakua Jingling Technology Co ltd
Current assignee: Shenzhen Kuakua Jingling Technology Co ltd
Priority date: 2023-08-28
Filing date: 2023-08-28
Publication date: 2024-04-19
Anticipated expiration: 2043-08-28
Also published as: CN117058699A

Abstract

The invention discloses a resume layout dividing method based on LayoutLMv model, which comprises the following steps: s1: performing fine adjustment on the LayoutLMv target detection model based on the self-labeling resume; s2: reasoning the non-labeling resume based on the fine-tuned LayoutLMv target detection model to acquire the title position information of the non-labeling resume; s3: and (3) dividing the sections of the non-labeling resume based on the title position information of the non-labeling resume and an OCR recognition algorithm obtained in the step (S2), and extracting the text information in each section. The invention can improve the accuracy of dividing the resume layout and can more accurately embody the information organization form in the resume.

Description

Resume layout dividing method, system and storage medium based on LayoutLMv model

Technical Field

The invention relates to the technical field of resume analysis, in particular to a resume layout dividing method, a resume layout dividing system and a storage medium based on LayoutLMv model.

Background

In recruitment, the recruiter needs to read the resume of the job seeker to screen whether the recruiter has the capability and experience of matching the job position, and the resume content is extracted in a structured mode according to the layout, so that the recruiter can quickly know personal information of the job seeker, and the recruitment resume screening efficiency is improved.

At present, the method for carrying out structural extraction on resume information is mainly carried out according to text keywords, for example, a resume data information analysis processing method proposed in CN 108874928A patent is a method for directly adopting keyword matching for the whole resume text content, but the method does not consider the influence of keywords in a text on a title, and has the possibility of causing layout division errors.

Disclosure of Invention

The invention aims to provide a resume layout dividing method, a system and a storage medium based on LayoutLMv model, which are characterized in that LayoutLMv model is applied to resume analysis, the layout dividing is firstly carried out on the fine granularity level through resume titles, the accuracy of resume layout dividing is improved, the information organization form in resume can be more accurately embodied, the data are structured on the basis, the resume is convenient to use and store in downstream tasks, the problem that layout positioning and analysis are difficult in diversified resume analysis can be reduced, and meanwhile, layout areas of resume of different types can be accurately identified in a mode of combining image vision auxiliary titles with text semantic information, so that the accuracy and recall rate of integral resume analysis are improved.

In order to achieve the above purpose, the following technical scheme is adopted:

a resume layout dividing method based on LayoutLMv models comprises the following steps:

s1: performing fine adjustment on the LayoutLMv target detection model based on the self-labeling resume;

S2: reasoning the non-labeling resume based on the fine-tuned LayoutLMv target detection model to acquire the title position information of the non-labeling resume;

S3: and (3) dividing the sections of the non-labeling resume based on the title position information of the non-labeling resume and an OCR recognition algorithm obtained in the step (S2), and extracting the text information in each section.

Further, the step S1 specifically includes the following steps:

S11: converting the resume into a picture format, dividing each title in the resume by using a rectangular frame, and representing the position of the rectangular frame where each title is located in the resume by using a four-tuple (x, y, box_width, box_height), wherein x represents the abscissa of the top left corner vertex of the rectangular frame, y represents the ordinate of the top left corner vertex of the rectangular frame, box_width represents the width of the rectangular frame, and box_height represents the height of the rectangular frame;

S12: and marking the position information of the resume title in a four-element mode in the step S11, writing the marking information into a JSON file, inputting the JSON file and the resume title into a LayoutLMv model together to perform fine tuning on the LayoutLMv model, and obtaining fine-tuned model parameters.

Further, the step S2 specifically includes the following steps:

S21: converting the non-labeling resume into a picture format, obtaining resume name, length and width information of the non-labeling resume, storing the resume name, length and width information into a JSON format, and inputting the resume name, length and width information and resume picture information into a LayoutLMv target detection model after fine adjustment;

S22: and loading the model parameters obtained in the S12, obtaining resume title position information of the non-labeling resume after model calculation reasoning, and storing the resume title position information in a JSON format.

Further, the step S3 specifically includes the following steps:

s31: the method comprises the steps of acquiring resume title position information in each resume according to a sequence from top to bottom, primarily dividing the resume into a plurality of sections according to the resume title position information, and simultaneously taking title text contents in each title section and text contents between the next resume title section adjacent to the title section as text contents of the title section;

s32: and extracting the text content in each title print based on an OCR (optical character recognition) algorithm, taking the first row of text in the extracted text content in each title print as the title of the print, and carrying out final print division by taking the title as a keyword.

Further, the dividing of the sections in S32 using the titles as keywords specifically includes the following steps:

S321: based on the layout and the content of the resume, the resume is divided into the following 7 sections in advance: basic information, working experience, educational background, project experience, self-evaluation, rewarding certificate and skill, wherein the plate labels corresponding to the 7 plates are respectively BASIC_INFORMATION、WORK_EXPERIENCE、EDUCATION BACKGROUND、PROJECT EXPERIENCE、SELF ASSESSMENT、REWARD_CERTIFICATES、SKILL;

S322: and for each edition, listing a keyword list, and for each detected resume title text content, matching keywords in the keyword list with the detected resume title text content, and dividing the title and the content thereof into edition corresponding to the keywords when any one keyword can be matched.

Further, in S322, for the basic information layout, if the real resume does not include the text corresponding to the layout, the content before the first title in the first page of the resume is used as the layout content.

Further, the resume layout dividing method based on LayoutLMv model further comprises the following steps:

S4: and visually displaying the title detection result on the corresponding resume.

Further, the step S4 specifically includes: and (3) for the resume title position information obtained in the step (S2), drawing a rectangular frame where the title is located on a corresponding position in the resume by using a python programming language.

The system for dividing the resume print blocks based on the LayoutLMv model comprises a memory, a processor and a computer program which is stored in the memory and can run on the processor, wherein the processor realizes the resume print block dividing method when executing the computer program.

There is also provided a computer readable storage medium storing a computer program adapted to be loaded and executed by a processor to cause a computer device having the processor to perform the above-described method.

By adopting the scheme, the invention has the beneficial effects that:

1) The labeled resume is used for fine tuning the LayoutLMv target detection model, the fine-tuned target detection model is used for reasoning the new resume, the dividing area of each edition is found out, instead of the conventional resume analysis mode, the resume is only converted into a plain text for analysis, the accuracy of dividing the edition of the diversified resume is higher, in the next analysis process, a named entity recognition technology is used for analyzing the detailed information of each edition, for example, the 'work experience' can be automatically presented in a segmented mode, and the information is ensured not to be lost basically;

2) The resume with various formats is converted into jpg picture format, and the text task is converted into visual task, so that the method can be suitable for resume data with various formats and sizes;

3) The method has wide application prospect, particularly in the human resource industry, can avoid the work of manually inputting system information, and can reduce the error rate of manual input.

Drawings

FIG. 1 is a flow chart diagram of the present invention;

FIG. 2 is a schematic diagram of resume labels according to an embodiment of the present invention;

FIG. 3 is a diagram showing the result of dividing the content of a resume layout according to an embodiment of the present invention.

Detailed Description

The invention will be described in detail below with reference to the drawings and the specific embodiments.

Referring to fig. 1 to 3, the invention provides a resume layout dividing method based on LayoutLMv3 models, which comprises the following steps:

s1: and fine tuning LayoutLMv the target detection model based on the self-labeling resume.

The LayoutLMv target detection model is based on a transducer structure, and the model itself is trained on 1100 ten thousand scanned document images, for better recognition and segmentation of resume data, the self-labeled resume data is first used to perform fine tuning on LayoutLMv, so as to obtain a target detection model adapting to resume data, and in one embodiment, the steps of fine tuning the LayoutLMv target detection model are as follows:

In this step, the LayoutLMv model is mainly trimmed, in this embodiment, the resume is firstly converted into JPG picture format (which converts text task into visual task, can adapt to resume data of various formats and sizes, and stores images to save memory resources of a computer), then the positions of the titles of all the sections in the resume in the picture are represented by a four-tuple (x, y, box_width, box_height), wherein x represents the abscissa of the top left corner vertex of the rectangular frame, y represents the ordinate of the top left corner vertex of the rectangular frame, box_width represents the width of the rectangular frame, and box_height represents the height of the rectangular frame; and then, the position information of the resume titles in the training data is in one-to-one correspondence with the resume according to the four-element organization mode, marking information of the resume titles is written into a JSON file, the JSON file and the resume titles are input into a LayoutLMv model together, the model is finely adjusted and calculated by using a GPU, and the model is a data marking result of a single resume as shown in fig. 2.

S2: and reasoning the non-labeling resume based on the fine-tuned LayoutLMv target detection model to acquire the title position information of the non-labeling resume.

In one embodiment, the method specifically includes:

In this embodiment, the object detection model LayoutLMv obtained by fine tuning is used to obtain the heading position information of the non-labeling resume, after the execution of S1 is completed, the LayoutLMv model and related model parameters after fine tuning of the resume image are obtained, and for the new non-labeling resume, the name, length, width and other information of the resume picture need to be obtained, then the picture information is organized into JSON format, and is taken as model input together with the picture data, the model parameters are loaded into the neural network model, and the resume heading coordinate position of the model reasoning is obtained, wherein the format is JSON, so that the coordinate quadruple of the new resume heading can be obtained.

In one embodiment, the method specifically includes:

This step is mainly aimed at achieving text information extraction, in this embodiment based on title location and OCR recognition algorithm, resulting in modular content of the resume. Typically, a resume contains multiple pictures, and the content is related across pages. Therefore, the title coordinates are ordered according to the resume page and the title frame ordinate, after the ordering is finished, the character content in the resume is obtained by using the OCR character recognition technology, the coordinates of the recognized characters are also given at the same time of OCR detection, and the content between two titles is judged to be the content of the last title layout block compared with the title ordinate. Then, the first line of characters of each plate are used as titles, the plates are divided into basic information, working experience, educational background, project experience, self-evaluation, rewarding certificate and 7 plates of SKILL, the 7 plates are respectively and correspondingly set with plate labels BASIC_INFORMATION,WORK_EXPERIENCE,EDUCATION_BACKGROUND,PROJECT_EXPERIENCE,SELF_ASSESSMENT,REWARD_CERTIFICATES and SKILL, for each plate, a keyword list is listed, for each detected resume title character content, keywords in the keyword list are matched with the keyword list, any keyword can be matched, the title and the content thereof are divided into plates corresponding to the keywords, as shown in fig. 3, the extraction and indication of resume layout characters are carried out, and the content of the same frame is one plate.

In addition, for the block of basic_information, since the real resume usually does not contain keywords, the present invention proposes that, for the resume which does not contain basic_information block keywords, the content before the first title in the first page of the resume is taken as the content of the block, and meanwhile, the present invention further includes step S4: the title detection result is visually displayed on the corresponding resume, and in an embodiment, the title detection result is specifically: and (3) for the resume title position information obtained in the step (S2), drawing a rectangular frame where the title is located on a corresponding position in the resume by using a python programming language so as to read and view.

In addition, a system for dividing the resume print based on LayoutLMv model is provided, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the computer program to realize the resume print dividing method, and meanwhile, a computer readable storage medium is provided, and the computer readable storage medium stores the computer program, and the computer program is suitable for being loaded and executed by the processor to enable a computer device with the processor to execute the method.

The processor may be a central processing unit (Central Processing Unit, CPU), other general purpose processor, digital signal processor (DIGITAL SIGNAL processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), off-the-shelf programmable gate array (field-programmable GATE ARRAY, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor, etc., and the memory may be a hard disk, a computer self-contained memory, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD), etc.

The computer readable storage medium may be an internal storage unit of a computer device, such as a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD), etc. that are provided on the computer device. Further, the computer-readable storage medium may also include both internal storage units and external storage devices of the computer device. The computer-readable storage medium is used to store the computer program and other programs and data required by the computer device. The computer-readable storage medium may also be used to temporarily store data that has been output or is to be output.

The foregoing description of the preferred embodiment of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims

1. A resume layout dividing method based on LayoutLMv models is characterized by comprising the following steps:

S3: dividing the sections of the non-labeling resume based on the title position information of the non-labeling resume and an OCR recognition algorithm obtained in the step S2, and extracting the text information in each section;

the step S1 specifically comprises the following steps:

S12: marking the position information of the resume title in a four-element mode in S11, writing the marked information into a JSON file, inputting LayoutLMv models together with the resume title to finely tune LayoutLMv models, and obtaining finely tuned model parameters;

The step S2 specifically comprises the following steps:

2. The resume layout dividing method based on LayoutLMv model according to claim 1, wherein the step S3 specifically includes the following steps:

3. The resume layout dividing method based on LayoutLMv model according to claim 2, wherein the step of dividing the layout by using the title as the key in S32 specifically comprises the following steps:

S321: based on the layout and the content of the resume, the resume is divided into the following 7 sections in advance: basic information, working experience, educational background, project experience, self-evaluation, rewarding certificate and skill, wherein the plate labels corresponding to the 7 plates are respectively BASIC_INFORMATION、WORK_EXPERIENCE、EDUCATION_BACKGROUND、PROJECT_EXPERIENCE、SELF_ASSESSMENT、REWARD_CERTIFICATES、SKILL;

4. The method for dividing a block of a resume based on the LayoutLMv model according to claim 3, wherein in S322, for a basic information block, if the real resume does not include text corresponding to the block, the content before the first title in the first page of the resume is taken as the content of the block.

5. The resume layout dividing method based on LayoutLMv models according to claim 1, wherein the resume layout dividing method based on LayoutLMv models further comprises the following steps:

6. The method for dividing a resume layout based on LayoutLMv model according to claim 5, wherein S4 is specifically: and (3) for the resume title position information obtained in the step (S2), drawing a rectangular frame where the title is located on a corresponding position in the resume by using a python programming language.

7. A system for dividing a resume layout based on LayoutLMv's model, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the resume layout dividing method according to any one of claims 1 to 6 when executing the computer program.

8. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program adapted to be loaded and executed by a processor to cause a computer device having the processor to perform the method of any of claims 1 to 6.