CN114997167A

CN114997167A - Resume content extraction method and device

Info

Publication number: CN114997167A
Application number: CN202210688679.2A
Authority: CN
Inventors: 弓源; 李长亮
Original assignee: Beijing Kingsoft Digital Entertainment Co Ltd
Current assignee: Beijing Kingsoft Digital Entertainment Co Ltd
Priority date: 2022-06-17
Filing date: 2022-06-17
Publication date: 2022-09-02

Abstract

The application provides a resume content extraction method and device, wherein the resume content extraction method comprises the following steps: acquiring a resume document to be identified; performing semantic recognition on the resume document, and splicing a plurality of lines of texts with associated semantics in the resume document into a line to obtain a spliced document; and identifying key fields from the spliced document, and extracting target resume content from the resume document according to the key fields. The scheme can improve the accuracy of resume content extraction.

Description

Resume content extraction method and device

Technical Field

The application relates to the technical field of computers, in particular to a resume content extraction method. The application also relates to a resume content extraction device, a computing device and a computer readable storage medium.

Background

With the development of internet technology, resumes in the form of electronic documents have seen explosive growth. In order to improve the processing efficiency of a large number of resumes, content extraction may be performed on the resumes.

In the related art, generally, the content in the resume is segmented, keyword matching is performed on the segmentation result, and based on the matching result, the content of the resume is extracted. However, in specific applications, the formats of the resume are often various, and some personalized layouts are not lacked, so that the distribution of the content in the resume is diversified, and thus, when the content of the resume is extracted, the word segmentation result of the content in the resume is easily inaccurate. Therefore, the simple keyword matching method easily causes inaccurate resume content extraction results, and the generalization of the method is poor, so that the method is difficult to be applied to various resumes with different formats.

Disclosure of Invention

In view of this, the embodiment of the present application provides a resume content extraction method to solve the technical defects in the prior art. The embodiment of the application also provides a resume content extraction device, a computing device and a computer readable storage medium.

According to a first aspect of the embodiments of the present application, there is provided a resume content extraction method, including:

acquiring a resume document to be identified;

performing semantic recognition on the resume document, and splicing a plurality of lines of texts with associated semantics in the resume document into a line to obtain a spliced document;

and identifying key fields from the spliced document, and extracting target resume content from the resume document according to the key fields.

According to a second aspect of embodiments of the present application, there is provided a resume content extraction apparatus including:

the document acquisition module is configured to acquire a resume document to be identified;

the text splicing module is configured to perform semantic identification on the resume document, splice a plurality of lines of texts with associated semantics in the resume document into a line, and obtain a spliced document;

and the content extraction module is configured to identify key fields from the spliced documents and extract target resume content from the resume documents according to the key fields.

According to a third aspect of embodiments herein, there is provided a computing device comprising:

a memory and a processor;

the memory is used for storing computer-executable instructions, and the processor realizes the steps of the resume content extraction method when executing the computer-executable instructions.

According to a fourth aspect of embodiments of the present application, there is provided a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the resume content extraction method.

According to a fifth aspect of embodiments of the present application, there is provided a chip storing a computer program which, when executed by the chip, implements the steps of the resume content extraction method.

According to the scheme provided by the embodiment of the application, the resume document to be identified is obtained; performing semantic recognition on the resume document, splicing a plurality of lines of texts with associated semantics in the resume document into a line to obtain a spliced document, recognizing key fields from the spliced document, and extracting target resume content from the resume document according to the key fields. Where there is a correlation between the semantics of multiple lines of text. Therefore, the spliced text can ensure that the target text which originally expresses the associated semantics but is subjected to line feed in the resume document is in one line, so that word separation caused by line feed is reduced. Therefore, the key fields are identified from the spliced document, so that the keyword identification errors caused by word separation caused by line feed can be reduced, and the accuracy of resume content extraction is improved.

Drawings

Fig. 1 is a schematic structural diagram of a resume content extraction system according to an embodiment of the present application;

fig. 2 is a flowchart of a first resume content extraction method according to an embodiment of the present application;

fig. 3 is a flowchart of a second resume content extraction method according to an embodiment of the present application;

fig. 4 is a flowchart of a third resume content extraction method according to an embodiment of the present application;

fig. 5 is a flowchart of a fourth resume content extraction method according to an embodiment of the present application;

fig. 6 is a flowchart of a fifth resume content extraction method according to an embodiment of the present application;

fig. 7 is a flowchart of a sixth resume content extraction method according to an embodiment of the present application;

fig. 8 is a flowchart of a seventh resume content extraction method according to an embodiment of the present application;

fig. 9 is a flowchart of an eighth resume content extraction method according to an embodiment of the present application;

fig. 10 is a flowchart illustrating a ninth method for extracting resume content according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of a resume content extraction apparatus according to an embodiment of the present application;

fig. 12 is a block diagram of a computing device according to an embodiment of the present application.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.

The terminology used in the one or more embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the present application. As used in one or more embodiments of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present application refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments of the present application to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first aspect may be termed a second aspect, and, similarly, a second aspect may be termed a first aspect, without departing from the scope of one or more embodiments of the present application.

First, the noun terms to which one or more embodiments of the present invention relate are explained.

Information extraction: refers to techniques for extracting structured information from structured text, semi-structured text, or unstructured text. The structured text refers to a text having a specific format, such as a title. Semi-structured text refers to text that contains both structured text and unstructured text. Unstructured text refers to text with a specific format that does not contain a title or the like, e.g., body content, abstract, etc.

Named Entity Recognition (NER): the method refers to the identification of entities with specific meanings in texts, and mainly comprises names of people, places, organizations, proper nouns and the like.

Entity: refers to a description of a word or phrase of an entity having a particular meaning in the text.

Text classification: meaning that in a given classification system, text is assigned to be classified into one or several categories.

Optical Character Recognition (OCR): the method refers to a process of analyzing, identifying and processing files in a certain form to obtain characters in the files.

Automaton (AC-TREE): a multi-pattern matching algorithm is commonly used to match sub-strings in a finite set of "dictionaries" such as a dictionary tree (trie tree) in an input string of characters.

Resume content extraction belongs to application of information extraction technology in a resume processing scene. The information extraction technology is a basic and important technology in the field of natural language processing, and is used for analyzing and processing structured, semi-structured and unstructured data and extracting structured text information. The resume content extraction can analyze the resume, and the content in the document is extracted based on the analysis result, so that the resume content extraction method has important significance and practical significance for company recruitment, talent assessment, talent management and the like.

Fig. 1 is a schematic structural diagram of a resume content extraction system according to an embodiment of the present application.

The execution main body of the resume content extraction method provided by the embodiment of the present application may be a server or a terminal, which is not limited in the embodiment of the present application. The terminal may be any electronic product capable of performing human-Computer interaction with a user, such as a Personal Computer (PC), a mobile terminal, a pocket PC, a tablet PC, and so on. The server may be one server, a server cluster composed of multiple servers, or a cloud computing service center, which is not limited in this embodiment of the present application.

Taking the execution main body as the terminal as an example, the terminal acquires the resume document to be identified; performing semantic recognition on the resume document, splicing a plurality of lines of texts with associated semantics in the resume document into a line to obtain a spliced document, recognizing key fields from the spliced document, and extracting target resume content from the resume document according to the key fields. And when a plurality of lines of texts with associated semantics in the resume document are spliced into one line to obtain a spliced document, a text splicing model may be used, and the text splicing model may be obtained through server training and sent to the terminal. When extracting target resume content from the resume document according to the key fields, a named entity recognition NER model may be used, and the named entity recognition NER model may be obtained through server training and sent to the terminal.

Taking the execution main body as a server as an example, the server acquires the resume document to be identified; performing semantic recognition on the resume document, splicing a plurality of lines of texts with associated semantics in the resume document into a line to obtain a spliced document, recognizing key fields from the spliced document, and extracting target resume content from the resume document according to the key fields. And the server can train on the basis of the first training sample to obtain a text splicing model, and the text splicing model is used when multiple lines of texts with associated semantics in the resume document are spliced into one line to obtain a spliced document. The server can obtain the NER model for recognizing the named entity based on the second training sample training, and when the target resume content is extracted from the resume document according to the key fields, the NER model for recognizing the named entity is used.

In the embodiment of the application, the semantics of the multi-line text are associated. Therefore, the spliced text can ensure that the target text which originally expresses the associated semantics but is subjected to line feed in the resume document is in one line, so that word separation caused by line feed is reduced. Therefore, the keyword fields are identified from the spliced documents, so that keyword identification errors caused by word separation caused by line feed can be reduced, and the accuracy of resume content extraction is improved.

Those skilled in the art should understand that the above-mentioned terminal and server are only examples, and other existing or hereafter-existing terminals or servers, such as may be suitable for the embodiments of the present application, should also be included in the scope of the embodiments of the present application, and are hereby incorporated by reference herein.

In the application, a resume content extraction method is provided. The present application also relates to a resume content extraction apparatus, a computing device, and a computer-readable storage medium, which are described in detail one by one in the following embodiments.

Fig. 2 is a flowchart illustrating a first resume content extraction method according to an embodiment of the present application, which specifically includes the following steps:

s202, obtaining the resume document to be identified.

In a particular application, a resume document refers to a document that records the content of a person's basic information, educational experiences, work experiences, project experiences, practice experiences, activity experiences, honor certificates, skill introductions, and so forth. The document format of the resume document may be various. Illustratively, the document format may include: text documents (word documents), presentations (PPT format documents), portable documents (PDF documents), and so forth. Any resume document in document format can be used in the present application, and the present embodiment does not limit this.

Also, the manner of obtaining the resume document to be recognized may be various. Illustratively, the resume document to be identified may be looked up from a database. Or, for example, the resume document uploaded by the user can be received as the resume document to be identified. This is all reasonable. Any method for obtaining the resume document to be identified can be used in the present application, and the present embodiment does not limit this.

S204, performing semantic recognition on the resume document, and splicing a plurality of lines of texts with associated semantics in the resume document into a line to obtain a spliced document.

In a particular application, under the influence of writing habits and resume formats, content describing a complete experience is likely to have line feeds, resulting in multiple lines of text. For example, the content "at laboratory L1 of school S1, complete project PJ1 following professor P1, to achieve achievement a 1", is a complete experience described: achievement A1 was obtained. Line feed was followed by line 1 "laboratory L1 at school S1", line 2 "project completion following professor P1", line 3 "PJ 1, achievement a 1". Therefore, the multiple lines of contents caused by line feed in the resume document can be spliced into one line to obtain the spliced document. Also, the multiple lines of content caused by the wrapping describe a complete experience and, therefore, are semantically related. Therefore, the resume document can be subjected to semantic recognition, and multiple lines of texts with associated semantics in the resume document are spliced into one line.

The manner of performing semantic recognition on the resume document may be various. Illustratively, there is a correlation between semantics of text before a period or a semicolon, according to the convention of word expression. Thus, a predetermined punctuation, such as a period or a semicolon, may be looked up from the summary document; and determining the multi-line text between the searched preset punctuations as the multi-line text with associated semantics. And for the searched first preset punctuation, determining a plurality of lines of texts before the first preset punctuation as a plurality of lines of texts with associated semantics. Or, for example, multiple lines of texts with associated semantics in the resume document may be spliced into one line by using a text splicing model obtained by pre-training. For ease of understanding and reasonable layout, the second example is described in detail below in the form of an alternative embodiment.

In addition, in order to better cope with the resume documents with diversified formats and contents, text extraction and blocking can be performed on the resume documents before semantic recognition is performed. For ease of understanding and reasonable layout, the manner in which the text is extracted and partitioned is described in detail below in the form of alternative embodiments.

And S206, identifying key fields from the spliced document, and extracting target resume contents from the resume document according to the key fields.

The key fields are fields in the target resume content, and specific key fields can be set according to the extraction requirement of the target resume content. For example, if the target resume content is content describing a story, the key fields may include "time", the name of the story, the level and result of the story, and so on. The target resume content is a certificate, then the key field may be a "certificate". The target resume content is skill, the key fields may be "skill", "specials", etc.

In a specific application, the key field is a field set based on an extraction requirement, and specific data of the key field needs to be extracted subsequently, so that the key field needs to be identified from the spliced document firstly, the way of identifying the key field from the spliced document can be various, one possible implementation way is to identify line by line, specifically, each line of the spliced document is matched with a preset key field, and a character which is successfully matched is determined as the key field, wherein the matching for each line can be performed in series or in parallel, namely, after one line is matched, the next line is matched, or the matching for each line is performed in parallel; in another possible implementation manner, the document may be identified as a whole, the spliced document is directly matched with a preset key field, and a character successfully matched is determined as the key field.

In addition, the way of extracting the target resume content from the resume document may also be various according to the key fields. The following detailed description is given by way of example.

For example, the layout information of the resume document can be determined according to the position information of the key field in the built document, and the content in the area represented by the specified layout information is extracted as the target resume content. Or, for example, the field type of the key field may be identified, and the key field data may be extracted from the document after the concatenation as the target resume content by using a target extraction manner corresponding to the field type. Or, for example, the spliced file may be partitioned into a plurality of text sub-blocks; and identifying the field type of the key field, and extracting the key field data from each text sub-block by using a target extraction mode corresponding to the field type to serve as target resume content. The second example is similar to the third example, except that the second example does not block the stitched document. For ease of understanding and reasonable layout, the third example is described in detail below in the form of an alternative embodiment. The implementation of the second example can be seen in the same sections in subsequent alternative embodiments with respect to the third example.

In the scheme provided by one embodiment of the application, there is correlation between semantics of multiple lines of text. Therefore, the spliced text can ensure that the original text in the resume document expresses the associated semantics and the target text subjected to line feed is in one line, so that word separation caused by line feed is reduced. Therefore, the key fields are identified from the spliced document, so that the keyword identification errors caused by word separation caused by line feed can be reduced, and the accuracy of resume content extraction is improved.

In an alternative implementation, as shown in fig. 3, which is a flowchart of a second resume content extraction method provided in an embodiment of the present application, in the method, target resume content is content describing a history, and the method includes the following steps;

s302, obtaining the resume document to be identified.

The step S302 is the same as the step S202 in the embodiment of fig. 2, and is not repeated herein, for details, see the description of the embodiment of fig. 2.

S304, extracting the resume text from the resume document by using a preset document character extraction tool, and blocking the resume text to obtain a target text block including the description experience.

In a specific application, the preset document character extraction tool may be, for example, PDFMiner. PDFMiner is a tool that can extract information from PDF documents. Unlike other PDF-related tools, it focuses on obtaining and analyzing text data entirely. The PDFMiner can obtain the exact location of the text in a certain page and some associated information such as font, line number. If PDFMiner is used, the document format of the resume document can be identified through the extension name of the resume document, a PDF conversion interface is called to convert the resume document of which the document format is not the PDF format into the PDF format, and then the PDFMiner is used for extracting the resume text from the resume document. The PDF conversion interface may be provided by a PDF document application. In addition, to further improve the extraction accuracy of the resume text, OCR may be combined with PDFMiner. Specifically, OCR can be utilized to identify characters, such as Chinese characters, in the resume document; on the basis, a preset document character extraction tool PDFMiner can be used for extracting characters with higher accuracy from the summary document to serve as correction characters; and correcting the recognition result of the OCR by using the corrected characters. Thus, OCR applicable to documents in various formats is complemented with PDFMiner tools with higher recognition accuracy, and the extraction accuracy of the resume text can be further improved.

Moreover, the partitioning the resume text specifically may include: and identifying title fields in the resume text, and dividing the content between two different title fields into one text block. On this basis, the text blocks obtained by division can be classified according to the difference of the contents included in the text blocks. For example, the header field adjacent to the first line of each text block may be determined as the header of the text block; and searching the text block type corresponding to the title of the text block from the pre-established corresponding relation between the title and the text block type. Or, each text block may be input into the first classification model obtained by pre-training, so as to obtain the text block type of the text block. The first classification model is obtained by training by utilizing the sample text block and the type label of the sample text block. Thus, a complete resume document may be divided into different types of text blocks, such as a text block including basic information, a text block including educational information, a text block including work information, a text block including activity information, a text block including honor certificates, a text block including skill information, and so forth. On the basis of the experience information, the experience information is generated through education, work and activity information. Therefore, obtaining the target text block including the description experience may specifically include, for example: at least one of a text block including education information, a text block including work information, and a text block including activity information is determined as a target text block.

S306, semantic recognition is carried out on the target text blocks in the resume document, and multiple lines of target texts with associated semantics in the target text blocks are spliced into one line to obtain a spliced document.

The target text block refers to a text block corresponding to the data describing the experience.

In practical application, a preset document character extraction tool is utilized to extract a resume text from a resume document, the resume text is partitioned to obtain each text block included in the resume text, whether a target text block describing the experience is included in each text block can be determined, if yes, the target text block including the describing experience can be determined to be obtained, namely the data describing the experience is structured data and is concentrated in the target text block, at the moment, semantic recognition is carried out on the target text block, multiple lines of target texts with associated semantics in the target text block are spliced into one line, and the spliced document can be obtained, so that specific data describing the experience can be directly extracted from the spliced document in the following period.

In a specific implementation, when determining whether each text block includes a target text block describing a story, a keyword (such as a title) of each text block may be extracted, and based on a preset keyword corresponding to the target text block describing the story, it is determined whether each text block is the target text block describing the story.

For example, assume that the set keywords corresponding to the target text block describing the experience are: education, work and activity, wherein each text block obtained by partitioning the resume text is a basic information text block, an education information text block, a work information text block, an activity information text block, a honor certificate text block and a skill information text block, wherein the education information text block, the work information text block and the activity information text block hit set keywords corresponding to a target text block describing experience, namely the obtained target text block is the education information text block, the work information text block and the activity information text.

It should be noted that, in each text block obtained by partitioning the resume text, if there is a target text block including the description experience, the data of the description experience is structured data, and the data of the description experience is concentrated in the target text block, so that it is only necessary to identify the target text block and extract specific data of the description experience, and it is not necessary to identify and match other text blocks except the target text block, thereby improving the content extraction efficiency.

In the embodiment of the application, the target text block refers to the text block obtained by blocking the resume text, the text block corresponding to the description experience is obtained in each text block, in order to obtain specific data describing the description experience in the target text block subsequently, semantic recognition can be performed on the target text block in the resume document, multiple lines of target texts with associated semantics in the target text block are spliced into one line to obtain a spliced document, the specific data describing the experience can be conveniently and directly identified and obtained from the spliced document subsequently, the data with associated semantics in the spliced document are spliced into one line, the extraction data loss caused by line change of the data with associated semantics is avoided, and the integrity of the extracted data is improved.

And S308, identifying key fields from the spliced document, and extracting target resume contents from the resume document according to the key fields.

The step S308 is the same as the step S206 in the embodiment of fig. 2, and is not repeated herein, and the details are described in the embodiment of fig. 2.

In an optional implementation manner, as shown in fig. 4, a flowchart of a third resume content extraction method provided in an embodiment of the present application, after the resume text is partitioned, the resume content extraction method provided in the embodiment of the present application may further include the following steps:

s402, determining whether a target text block is obtained;

s404, if the target text block is not obtained, identifying key fields from each block of the resume text, and extracting the target resume content from the resume document according to the key fields.

It should be noted that, partitioning the resume text may obtain each text block included in the resume text, and then may determine whether each text block includes a target text block describing the experience, and if not, may determine that the target text block including the describing experience is not obtained, that is, the data describing the experience is unstructured data, and the data describing the experience may be dispersedly distributed in each position of the resume text, so that each block of the resume text may have a key field, at this time, the key field may be identified from each block of the resume text, and the target resume content may be extracted from the resume document according to the key field.

In a specific application, for data of an unstructured text, that is, data of description experiences scattered at various positions of a resume text, there is very probably no line feed. Therefore, it is possible to directly identify key fields from the blocks of the resume text and perform the extraction of the target resume content from the resume document according to the key fields. The identification of the key fields is equivalent to performing key field matching on the full resume text, namely each block of the resume text.

Wherein, the key field can be "graduation school: academy names "," specialties: professional name "," calendar: the name of the calendar "," graduation time: date ", and the like. Also, the institution name, specialty name, calendar name, and date may be fuzzy search. The fuzzy search is a concept opposite to the precise search, and means that a large number of search results are obtained by searching according to synonyms of key fields from the summary text. Synonyms can be configured according to specific needs. If "school name" is configured as a synonym with "Beijing university" and "Qinghua university" and "school name" is retrieved, the contents including "Beijing university" and "Qinghua university" will be taken as the retrieval result. Similarly, synonyms for the name of the specialty, the name of the academic calendar, and the date may be set. For example, the target resume content "graduation institution" is extracted from the resume document: beijing university; specialization: a computer; learning the calendar: master; graduation time: 10 months in 2010.

The embodiment can extract the target resume content described in the form of the unstructured text (namely, dispersedly distributed at each position of the resume text), thereby improving the application range of the embodiment of the application.

In an optional implementation manner, as shown in fig. 5, which is a flowchart of a fourth resume content extraction method provided in an embodiment of the present application, the extracting target resume content from the resume document according to the key field may specifically include the following steps:

and S502, determining the similarity among the key fields.

In a specific application, the determination of the similarity between the key fields may be various. Exemplarily, euclidean distances between the key fields may be calculated as the similarity; or, counting the number of the same characters with the same position in each key field as the similarity; or, it is reasonable to establish a synonym library in advance, match each key field with the synonym library, and obtain a matching result of whether the key field is a synonym or not, as the similarity. Synonyms refer to words with the same semantics, for example, a complete word and a shorthand word of the complete word are synonyms.

S504, determining target key fields with similarity reaching similar conditions from all the key fields.

Exemplary, similar conditions may specifically include: similarity is greater than a similarity threshold, synonyms, and so forth. The similarity threshold may be set according to a specific scenario, for example, for a key field of the token time, the similarity threshold may be 1, that is, the key field of the token time is identical and is considered to be repeated, so that the setting for the time scenario may further improve the accuracy. For example, determining the key fields "2012" and "2011" as target key fields may be avoided.

S506, the target key fields are subjected to duplication elimination, and the key fields except the target key fields in all the key fields are fused to obtain target resume contents.

The duplication removal of the target key field refers to: and selecting one of the target key fields as a reserved key field, and deleting the key fields except the reserved key field in the target key field. Thus, the embodiment can realize the duplication elimination of the target resume content and further improve the accuracy of resume content extraction. For example, due to differences in writing habits, the resume is likely to include two parts of content that respectively recite and detail the target resume content. The key fields of the two parts of content are likely to be repeated, so the scheme provided by the embodiment can be implemented to perform deduplication so as to achieve the effect of removing redundant content. Moreover, fusing the key fields except the target key field in each key field means that: and performing content complementation on the non-repeated key fields according to the difference. For example, the key field f1 is "professional: "and the key field f2 is" piano specialty ", the key field f1 and the key field f1 may be merged to obtain" specialty: piano ".

In an optional implementation manner, as shown in fig. 6, which is a flowchart of a fifth resume content extraction method provided in an embodiment of the present application, the performing semantic recognition on the resume document, and splicing multiple lines of texts with associated semantics in the resume document into a line to obtain a spliced document specifically may include the following steps:

and S602, splicing multiple lines of texts with associated semantics in a resume document into one line by using a text splicing model obtained by pre-training to obtain a spliced document, wherein the text splicing model is obtained by using a resume sample document for training, the resume sample document comprises multiple lines of cut texts for randomly cutting and changing lines of an original text, and the multiple lines of cut texts belonging to the same line of the original text have tags representing the associated semantics.

In a specific application, the original text can be manually constructed, or the original text can be extracted from the historical resume document by using a preset document character extraction tool. On the basis, the original text is cut randomly and subjected to line feed to obtain a multi-line cut text, the richness of line feed conditions reflected by the multi-line cut text can be improved through random line feed cutting, the applicability to writing habits and layout diversification is improved, and the splicing accuracy is further improved. For example, the tag representing that there is no semantic association may be specifically the tag "1", and the tag representing that there is no semantic association may be specifically the tag "0". For example, there is a correlation between the semantics of the cut text T1 and the semantics of the cut text T2, and if splicing is required, the tag is "1"; the cut text T2 and the cut text T3 have no semantic association, and do not need to be spliced, so that the label is '0'.

In an optional implementation manner, as shown in fig. 7, which is a flowchart of a sixth resume content extraction method provided in an embodiment of the present application, the identifying a key field from the spliced document, and extracting the target resume content from the resume document according to the key field may specifically include the following steps:

s702, identifying key fields from the spliced documents, and determining the layout information of the resume documents according to the key fields.

In a specific application, determining the layout information of the resume document according to the key field may include: and obtaining the layout information of the resume document according to the sequence of the key fields in the resume document. For example, the chronological order of the key fields in the resume document is "time", "school" and "professional", and the layout information of the resume document is "time school professional".

S704, dividing the resume document into a plurality of text sub-blocks according to the layout information.

Illustratively, dividing the resume document into a plurality of text sub-blocks according to the layout information may specifically include: and dividing the text between two adjacent layout information and the layout information with the front position into a text sub-block. For example, if the texts between the first layout information "time D1 school S1 professional M1" and the second layout information "time D2 school S2 professional M2" are adjacent, the text T1 between the first layout information and the second layout information, and the first layout information are a text sub-block, that is, the corresponding education experience [ time D1 school S1 professional M1, text T1 ], where the text T1 may be the content describing the experience in the form of unstructured text. In addition, the dividing of the different format information may specifically include: at least one of time, school and specialty with different types of information is divided into different types of information.

And S706, respectively extracting key field data from each text sub-block, and obtaining target resume content according to the key field data.

In a specific application, the way of extracting the key field data from each text sub-block may be various. For example, the data of the keyword field can be extracted from each text sub-block directly by using a named entity recognition NER model or a preset expression rule. Or, for example, different target extraction methods may be adopted for extracting the key field data of different field types. The second example described above is specifically described below in the form of an alternative embodiment.

In an optional implementation manner, as shown in a flowchart of a seventh resume content extraction method provided in fig. 8 in an embodiment of the present application, the extracting key field data from each text sub-block may specifically include the following steps:

s802, extracting the key field data from each text sub-block by using a target extraction mode corresponding to the field type of the key field.

In a specific application, determining a target extraction manner corresponding to a field type may specifically include: and searching a target extraction mode corresponding to the determined field type from the pre-established corresponding relation between the field type and the extraction mode. Or calling an object extraction mode carrying the field type, and the like. According to the method and the device, the key field data corresponding to the key field are extracted through the target extraction mode corresponding to the field type of the key field, and a more appropriate target extraction mode can be set according to the characteristics of the key field data of different field types, so that the extraction efficiency and the accuracy are improved. The manner in which the key field data is extracted is described below in an alternative embodiment.

In an alternative implementation manner, as shown in fig. 9, which is a flowchart of an eighth resume content extraction method provided in an embodiment of the present application, the field types include: a first field type, and/or a second field type; correspondingly, the extracting key field data from each text sub-block by using the target extracting manner corresponding to the field type of the key field may specifically include the following steps:

s902, respectively extracting key field data from each text subblock by using a named entity recognition NER model or an expression rule corresponding to a first field type;

and S904, respectively extracting candidate key field data from each text subblock by using the NER model for identifying the named entity, extracting correction data from the text subblock corresponding to the candidate key field data by using the expression rule corresponding to the second field type, and correcting the corresponding candidate key field data by using the correction data to obtain the key field data.

The first field type in the field types refers to a type in which the extracted key field data (i.e., entities) are accurate and do not need to be corrected. For example, since the extraction of the digital information is often accurate and does not need to be corrected, the first field type may be a digital information type, such as a time type, and the key field data of the first field type may include: time data.

The second field type of the field types is a type in which the extracted key field data (i.e., entities) may have errors and needs to be corrected. For example, since the extraction process of the text information is affected by semantics, word segmentation, sentence break, etc., and the extraction of the text information may have errors and needs to be corrected by rules, the second field type may be a text information type, for example, the second field type may be a professional name type, a company name type, a school name type, a academic calendar type, a job level type, etc., and the keyword field data of the second field type may include: specific names of professions, specific names of companies, specific names of schools, specific academic calendars, specific job classes, and the like. The specific key field data corresponding to the field type may be divided according to a specific application scenario, which is only an example, and the field type of the field data in the above example may be changed according to a specific situation.

Therefore, the keyword field data of both the first field type and the second field type can be extracted by using the named entity recognition model, and the difference is that the extracted result may need to be corrected by using the expression rule corresponding to the second field type due to the complexity of the keyword field data of the second field type. In one case, the keyword field data of the first field type may be extracted from each text sub-block directly by using an expression rule corresponding to the first field type. For ease of understanding, each extraction method is specifically described below.

In a specific application, the named entity recognition NER model is obtained by training with a sample experience and data labels of each key field of the sample experience. Specifically, the training of the NER model is a supervised training process, and a certain amount of data labeling is required, such as a sample educational experience, and labeling of key field data of the sample educational experience, such as start time, end time, school, specialty, academic calendar and the like. And further, training according to preset training rules including fixed training round numbers, early stop setting parameters and the like by using the sample experience and each key field data label of the sample experience. For example, a fixed exercise book may end up training 10 rounds, and a particular application may end up training less than 10 rounds, i.e., early stop. Early stopping is because the model training may have been fit for 3 to 5 rounds, and continuing training may cause overfitting, reducing the accuracy of the model, and therefore, may stop training in advance. In this way, the trained NER model can extract the keyword field data in the experience text, i.e. each text sub-block, or the candidate keyword field data to be corrected.

For example, the extracting the keyword field data from each text sub-block by using the expression rule corresponding to the first field type may specifically include: and respectively extracting the elapsed start time and the elapsed end time from each text sub-block by using a first expression rule to obtain time data. The first expression rule may specifically be a rule set according to an expression format of the elapsed start time and end time. For example, a regular expression set according to the expression format of the time data. Any regular expression is a logic formula for operating on character strings, namely a 'regular character string' is formed by using a plurality of specific characters defined in advance and a combination of the specific characters, and the 'regular character string' is used for expressing a filtering logic for the character strings. In particular, a regular expression may be a text pattern that describes one or more strings to be matched when searching for text.

For example, the expression rule corresponding to the second field type may specifically include: presetting at least one of a dictionary tree, a second expression rule and a third expression rule; accordingly, the extracting the correction data from each text sub-block by using the expression rule corresponding to the second field type may specifically include the following steps:

and respectively extracting correction data from each text sub-block by using a preset dictionary tree, a second expression rule and/or a third expression rule.

The predetermined dictionary TREE may be, for example, AC-TREE. AC-TREE can guarantee that for a given text of length n, and a set of patterns P { P1, P2.. pm }, within o (n) time complexity, all target patterns in the text are found, regardless of the size m of the set of patterns. The AC-TREE obtaining method may include: constructing a tree structure of the key field data, and marking an end node, namely whether the end node is the end of one key field data; and constructing a failed node when the matching fails for the nodes on the tree structure, namely skipping to which node to continue the matching when the matching fails. The specific application process of the AC-TREE comprises the following steps: and traversing the candidate names once, matching each character from the tree structure of the key field data, and starting to match from the current node position. And if the matching is successful, jumping from the current node to the child node of the current node. If the current node is an end node, the matching is successful; and if the matching is failed, jumping to a failed node corresponding to the node from the current node, and continuing to match until the matching is successful or the current node is the root node. Therefore, the key field data with character missing can be subjected to character supplementation through the skipping of the failed node.

The second expression rule may be a rule set in accordance with an expression format of the erroneously extracted name. For example, a regular expression set according to the expression format of a course name including a professional name. In this way, it is possible to avoid extracting the course name including the professional name as the professional name. The third expression rule may be a rule set in an expression format according to a level of the mis-extraction. For example, a regular expression arranged according to a representation format containing laboratory names of the academic calendar. In this way, the extraction of laboratory names containing a scholarly calendar as a scholarly calendar can be avoided. Moreover, the correcting the corresponding candidate key field data by using the correction data to obtain the key field data may specifically include: and replacing the corresponding candidate key field data by the correction data to obtain the key field data. Therefore, through replacement, if the candidate key field data of the key field is empty, the correction data of the key field can be used for filling the empty key field data; if the candidate key field data of the key field is erroneous, such as missing words, scrambled codes, scrambled order, etc., the correction data may replace the erroneous data. In addition, after the extraction of the time data, the name data, and the level data is completed, data describing the experience may be extracted. Specifically, the experience data may be obtained by extracting texts other than the time data, the name data, and the level data from each text sub-block, respectively.

According to the embodiment, different modes are adopted for extracting the key field data from different types of key fields, so that the extraction accuracy can be further improved by a targeted differential processing mode. Moreover, the experience data is unstructured text, and the problems of difficulty and inaccurate extraction result exist in direct recognition and extraction. Therefore, in the embodiment, for data describing experiences, texts other than time data, name data and level data are respectively extracted from each text sub-block to obtain the experience data, so that the difficulty of extracting keyword field data of unstructured texts can be reduced, and the convenience and accuracy of resume content extraction are further improved.

The resume content extraction method provided by the present application is further described below with reference to fig. 10 as an example of application of the resume content extraction method in the extraction of educational experiences. Fig. 10 shows a processing flow chart of a ninth resume content extraction method provided in an embodiment of the present application, which specifically includes the following steps:

inputting: a resume document; OCR + character correction; dividing the resume into blocks;

if the target text block is not contained, full-text search is carried out; and (3) outputting: extracting results of the education experience;

if the target text block is included, performing text splicing, and extracting key fields after sub-block division; and (3) outputting: and extracting results from the educational experience.

In a specific application, the content of the resume document is often diversified due to the layout, the writing habit and the like. For example, a partial resume document exists educational experience content in the form of unstructured text, such as "at school S1 laboratory L1, complete project PJ1 following professor P1, resulting in achievement A1 … …". Another part of the resume document only has educational experiences in the form of structured text, such as "date D1, school S1, professional M1, academic textbook", "date D2, school S2, professional M2, academic Master graduate". Therefore, in order to more accurately extract the content of the educational experience from the content in the non-structural text form and to be suitable for the case that the resume document only has the educational experience in the structural text form, resume partitioning can be performed, so that the educational experience extraction can be performed in different ways for different resume document cases containing target text blocks and no target text blocks. In order to realize resume blocking, resume texts can be extracted through OCR recognition and character correction, and are blocked. The OCR + character correction may specifically include: and extracting candidate characters from the established document by using an OCR recognition tool, and correcting the candidate characters to obtain the resume text. Wherein the correcting may include: deduplication of repeated characters, complementation of missing characters, replacement of erroneous characters, and so on.

For the case that the target text block is not contained, the full-text search refers to the extraction of key fields of the whole resume document. And, the key field is a field describing an educational experience. Therefore, the educational experience extraction result can be obtained and output after the full-text search is completed.

For the case containing a target text block: text splicing, sub-block and key field extraction can be regarded as a resume parsing process. The resume parsing process may be implemented by an extraction policy, and the extraction policy may specifically include: NER model, canonical template, and AC-TREE. The text splicing specifically refers to line feed splicing of a target text block: and splicing the texts which have associated semantics but are converted into a plurality of lines due to writing habits and layout reasons into a line. The experience molecular block is used for dividing the experience represented by the target text block into a plurality of text sub-blocks according to the layout information. The extraction of the key fields refers to extracting the key field data corresponding to the key fields of different field types. Thus, the extracted data of each keyword field can be output as the extraction result of the education experience.

In the embodiment, different ways are adopted for extracting the education experiences aiming at different resume documents containing the target text block and the resume documents not containing the target text block, so that the content diversification of the resume documents can be dealt with, and the application range of the embodiment of the application is expanded. And, the extraction accuracy of the resume document containing the target text block can be improved.

In addition, similar to the application of the above-mentioned education experience extraction, the resume content extraction method provided by the embodiment of the application can be applied to the extraction of various types of experience contents such as work experiences, project experiences, practice experiences, activity experiences and the like, and the difference is that the specific key fields and the extracted contents are different. For the same parts, which are not described in detail herein, reference may be made to the above description of the embodiment of the educational story extraction in fig. 10.

Corresponding to the above method embodiment, the present application further provides an embodiment of a resume content extraction apparatus, and fig. 11 shows a schematic structural diagram of a resume content extraction apparatus provided in an embodiment of the present application. As shown in fig. 11, the apparatus includes:

a document acquisition module 1102 configured to acquire a resume document to be identified;

a text splicing module 1104 configured to perform semantic identification on the resume document, and splice a plurality of lines of texts with associated semantics in the resume document into one line to obtain a spliced document;

a content extraction module 1106 configured to identify key fields from the stitched document, and extract target resume content from the resume document according to the key fields.

In an optional implementation, the text splicing module 1104 is further configured to:

and splicing a plurality of lines of texts with associated semantics in the resume document into a line by utilizing a text splicing model obtained by pre-training to obtain a spliced document, wherein the text splicing model is obtained by utilizing the resume sample document training, the resume sample document comprises a plurality of lines of cut texts for randomly cutting and changing lines of the original text, and the plurality of lines of cut texts belonging to the same line of the original text have tags representing associated semantics.

In an optional implementation, the content extraction module 1106 is further configured to:

identifying key fields from the spliced documents, and determining the layout information of the resume documents according to the key fields;

dividing the resume document into a plurality of text sub-blocks according to the layout information;

and respectively extracting key field data from each text sub-block, and obtaining target resume content according to the key field data.

and extracting the key field data from each text sub-block by using a target extraction mode corresponding to the field type of the key field.

In an alternative embodiment, the field types include: a first field type, and/or a second field type;

accordingly, the content extraction module 1106 is further configured to:

respectively extracting key field data from each text subblock by using a named entity recognition NER model or an expression rule corresponding to the first field type;

and respectively extracting candidate key field data from each text subblock by using a named entity recognition NER model, extracting correction data from each text subblock by using an expression rule corresponding to the second field type, and correcting the corresponding candidate key field data by using the correction data to obtain the key field data.

In an alternative embodiment, the target resume content is content describing an experience; the apparatus further comprises a blocking module configured to:

extracting a resume text from the resume document by using a preset document character extraction tool, and partitioning the resume text to obtain a target text block including description experience;

the text splicing module 1104 is further configured to:

and carrying out semantic recognition on the target text block in the resume document, and splicing a plurality of lines of target texts with associated semantics in the target text block into a line to obtain a spliced document.

In an alternative embodiment, the text splicing module 1104 is further configured to:

and if the target text block is not obtained, identifying the key field from each block of the resume text, and extracting target resume content from the resume document according to the key field.

determining the similarity between the key fields;

determining target key fields with similarity reaching similar conditions from the key fields;

and removing the duplication of the target key fields, and fusing the key fields except the target key fields in all the key fields to obtain the target resume content.

The above is a schematic scheme of a resume content extraction apparatus of the present embodiment. It should be noted that the technical solution of the resume content extraction apparatus and the technical solution of the resume content extraction method described above belong to the same concept, and details of the technical solution of the resume content extraction apparatus, which are not described in detail, can be referred to the description of the technical solution of the resume content extraction method described above. Further, the components in the device embodiment should be understood as functional blocks that must be created to implement the steps of the program flow or the steps of the method, and each functional block is not actually divided or separately defined. The device claims defined by such a set of functional modules are to be understood as a functional module framework for implementing the solution mainly by means of a computer program as described in the specification, and not as a physical device for implementing the solution mainly by means of hardware.

Fig. 12 shows a block diagram of a computing device according to an embodiment of the present application. Components of the computing device 1200 include, but are not limited to, a memory 1210 and a processor 1220. Processor 1220 is coupled to memory 1210 via bus 1230, and database 1250 is used to store data.

The computing device 1200 also includes an access device 1240, the access device 1240 enabling the computing device 1200 to communicate via one or more networks 1260. Examples of such networks include a Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. The Access device 1240 may include one or more of any type of Network Interface (e.g., a Network Interface Controller) whether wired or Wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) Wireless Interface, a Worldwide Interoperability for Microwave Access (Wi-MAX) Interface, an ethernet Interface, a Universal Serial Bus (USB) Interface, a cellular Network Interface, a bluetooth Interface, a Near Field Communication (NFC) Interface, and so forth.

In one embodiment of the application, the above components of the computing device 1200 and other components not shown in fig. 12 may also be connected to each other, for example, by a bus. It should be understood that the block diagram of the computing device structure shown in FIG. 12 is for purposes of example only and is not limiting as to the scope of the present application. Those skilled in the art may add or replace other components as desired.

Computing device 1200 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet computer, personal digital assistant, laptop computer, notebook computer, netbook, etc.), mobile phone (e.g., smartphone), wearable computing device (e.g., smartwatch, smart glasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 1200 may also be a mobile or stationary server.

Wherein, the processor 1220 is used for executing the computer-executable instructions of the resume content extraction method.

The above is an illustrative scheme of a computing device of the present embodiment. It should be noted that the technical solution of the computing device and the technical solution of the resume content extraction method described above belong to the same concept, and details that are not described in detail in the technical solution of the computing device can be referred to the description of the technical solution of the resume content extraction method described above.

An embodiment of the present application also provides a computer readable storage medium storing computer instructions that, when executed by a processor, are used for a resume content extraction method.

The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium and the technical solution of the resume content extraction method described above belong to the same concept, and for details that are not described in detail in the technical solution of the storage medium, reference may be made to the description of the technical solution of the resume content extraction method described above.

An embodiment of the present application further provides a chip, in which a computer program is stored, and the computer program implements the steps of the resume content extraction method when executed by the chip.

The foregoing description of specific embodiments of the present application has been presented. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying said computer program code, a recording medium, a usb-disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a Random Access Memory (RAM), an electrical carrier signal, a telecommunications signal, a software distribution medium, etc.

It should be noted that, for the sake of simplicity, the above-mentioned method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art will appreciate that the embodiments described in this specification are presently considered to be preferred embodiments and that acts and modules are not required in the present application.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The preferred embodiments of the present application disclosed above are intended only to aid in the explanation of the application. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the teaching of this application. The embodiments were chosen and described in order to best explain the principles of the application and its practical applications, to thereby enable others skilled in the art to best understand and utilize the application. The application is limited only by the claims and their full scope and equivalents.

Claims

1. A resume content extraction method is characterized by comprising the following steps:

acquiring a resume document to be identified;

2. The method according to claim 1, wherein the semantic recognition of the resume document, and the splicing of multiple lines of texts in the resume document, which are associated with semantics, into one line, to obtain a spliced document comprises:

and splicing a plurality of lines of texts with associated semantics in the resume document into a line by utilizing a text splicing model obtained by pre-training to obtain a spliced document, wherein the text splicing model is obtained by utilizing a resume sample document for training, the resume sample document comprises a plurality of lines of cut texts for randomly cutting and changing lines of an original text, and the plurality of lines of cut texts belonging to the same line of the original text have tags representing associated semantics.

3. The method of claim 1, wherein the identifying key fields from the stitched document, and extracting target resume content from the resume document according to the key fields comprises:

4. The method of claim 3, wherein the extracting of the keyword field data from each text sub-block respectively comprises:

5. The method of claim 4, wherein the field type comprises: a first field type, and/or a second field type;

the extracting the key field data from each text sub-block by using the target extracting mode corresponding to the field type of the key field comprises the following steps:

6. The method according to any one of claims 1 to 5, wherein the target resume content is content describing an experience;

before the semantic recognition is performed on the resume document, and a plurality of lines of target texts with associated semantics in the resume document are spliced into a line to obtain a spliced document, the method further includes:

the semantic recognition of the resume document and the splicing of multiple lines of target texts with associated semantics in the resume document into one line to obtain the spliced document comprise:

7. The method of claim 6, wherein after the chunking the resume text, the method further comprises:

8. The method according to any one of claims 1 to 5, wherein the extracting target resume content from the resume document according to the key field comprises:

determining the similarity between the key fields;

9. A resume content extraction apparatus, comprising:

10. A computing device, comprising:

a memory and a processor;

the memory is configured to store computer-executable instructions, and the processor is configured to execute the computer-executable instructions to implement the steps of the resume content extraction method of any one of claims 1 to 8.

11. A computer-readable storage medium storing computer instructions which, when executed by a processor, implement the steps of the resume content extraction method of any of claims 1 to 8.