CN116521856A - Resume abstract generation method and device, electronic equipment and storage medium - Google Patents

Resume abstract generation method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN116521856A
CN116521856A CN202310282784.0A CN202310282784A CN116521856A CN 116521856 A CN116521856 A CN 116521856A CN 202310282784 A CN202310282784 A CN 202310282784A CN 116521856 A CN116521856 A CN 116521856A
Authority
CN
China
Prior art keywords
resume
abstract
target
word segmentation
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310282784.0A
Other languages
Chinese (zh)
Inventor
李羊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
ICBC Technology Co Ltd
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
ICBC Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC, ICBC Technology Co Ltd filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202310282784.0A priority Critical patent/CN116521856A/en
Publication of CN116521856A publication Critical patent/CN116521856A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/105Human resources
    • G06Q10/1053Employment or hiring
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • Human Resources & Organizations (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Strategic Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method and a device for generating a resume abstract, electronic equipment and a storage medium, and relates to the technical field of artificial intelligence or other related fields, wherein the method for generating the resume abstract comprises the following steps: receiving a target resume file, and preprocessing the target resume file to obtain resume sequence information; inputting the target resume file and resume sequence information into a target language model, and receiving the resume abstract text of the target resume file output by the target language model. The invention solves the technical problem that the generated resume abstract information is inaccurate due to word frequency statistics only when the resume abstract is generated in the related technology.

Description

Resume abstract generation method and device, electronic equipment and storage medium
Technical Field
The invention relates to the field of artificial intelligence or other related technical fields, in particular to a resume abstract generation method and device, electronic equipment and storage medium.
Background
With the improvement of social mobility, staff with different job positions are required to be recruited between various companies and factories, or government problems are generated, for example, during recruitment, a mode is often used in which after staff screening is performed in a mode of receiving resume through a designated recruitment platform, the staff requiring recruitment is selected. Most of the current resume screening modes are to browse the resume manually, so that a lot of time is spent, and the resume screening efficiency is low.
In order to improve the resume screening efficiency, the current recruitment platforms or recruitment client software used by companies can pre-process the resume to generate a resume abstract, so that a user can quickly look up important information of people wanting to recruit.
However, the resume summary generation method in the related art has many problems:
first, the traditional algorithm for establishing the abstract word frequency statistics is a word frequency-based statistical method, and sentence meaning capability is not summarized. When the resume abstract is generated, each sentence in the text is generally regarded as a node, if two sentences have similarity, the fact that an undirected weighted edge exists between the nodes corresponding to the two sentences is considered, the disadvantage is that the data training is very dependent, statistics is carried out only from word frequency, semantic information is not available, the generated resume abstract information is inaccurate, and the generated quality is low.
Second, the full text reading of the resume takes much time, and when the number of the resume is huge, the time is very long.
In view of the above problems, no effective solution has been proposed at present.
Disclosure of Invention
The embodiment of the invention provides a method and a device for generating a resume abstract, electronic equipment and a storage medium, which at least solve the technical problem that generated resume abstract information is inaccurate due to word frequency statistics only when the resume abstract is generated in the related technology.
According to an aspect of the embodiment of the present invention, there is provided a method for generating a resume summary, including: receiving a target resume file, and preprocessing the target resume file to obtain resume sequence information; inputting the target resume file and the resume sequence information into a target language model, and receiving the resume abstract text which is output by the target language model and is related to the target resume file.
Optionally, the step of preprocessing the target resume file to obtain resume sequence information includes: performing paragraph segmentation on the target resume file to obtain a paragraph set; word segmentation processing is carried out on each resume segment in the paragraph set; and carrying out serialization processing on all resume word information obtained after word segmentation processing to obtain resume sequence information.
Optionally, the target language model is pre-trained, and when training the target language model, the method includes: acquiring an initial resume set, wherein the initial resume set comprises N historical resume texts; preprocessing each history resume text in the initial resume set to obtain a resume word segmentation set corresponding to each history resume text; shielding part of resume word segmentation in the resume word segmentation set to obtain a shielding resume word segmentation set; inputting the history resume text and the corresponding shielding resume word segmentation set into an initial pre-training model, and outputting history abstract information in a specified format; the history abstract information output by the initial pre-training model is adjusted by adopting the history resume text and the history resume abstract result extracted in the history process, and iterative training is carried out on the initial pre-training model; and under the condition that the similarity between the historical abstract information output by the initial pre-training model and the historical resume abstract result is larger than a preset similarity threshold, confirming that model training is finished, and obtaining the target language model.
Optionally, the step of preprocessing each history resume text in the initial resume set to obtain a resume word segmentation set corresponding to each history resume text includes: performing word segmentation processing on the history resume text by adopting a preset word segmentation device; and converting the initial word segmentation sequence obtained after word segmentation processing into a symbol format which accords with the symbol format which can be used in the initial pre-training model, and obtaining a resume word segmentation set corresponding to each history resume text.
Optionally, before the initial resume set is acquired, the method further includes: obtaining a word segmentation vocabulary containing all sentence word segmentation; training to obtain the initial word segmentation device by the word segmentation vocabulary and a plurality of resume formats.
Optionally, before the initial resume set is acquired, the method further includes: acquiring a natural language processing model BERT, and taking the natural language processing model BERT as the initial pre-training model; configuring network parameters and network structures of M networks in the initial pre-training model, wherein the M networks at least comprise: a backbone network and a language processing network, M is a positive integer greater than 1; the attention scope of each sentence word segmentation is controlled through mask moment for each level in the backbone network.
Optionally, after inputting the target resume file and the resume sequence information into the target language model, the method further comprises: acquiring a representation joint word vector, a position vector and affiliated text segment information of a resume word by the target language model, wherein the representation joint word vector is used for indicating the joint word vector which has contextual association with the resume word; inputting the characterization joint word vector, the position vector and the text segment information to T language processing networks in the target language model, wherein T is smaller than M and is a positive integer greater than or equal to 1; and carrying out sentence association analysis on each resume word in the resume sequence information by the language processing network by adopting a self-attention mechanism, and determining a text word segmentation vector corresponding to the resume word, wherein the text word segmentation vector is used for determining resume abstract information of association-specified resume attributes in the target resume file to obtain the resume abstract text.
According to another aspect of the embodiment of the present invention, there is also provided a device for generating a resume summary, including: the receiving unit is used for receiving the target resume file and preprocessing the target resume file to obtain resume sequence information; the generation unit is used for inputting the target resume file and the resume sequence information into a target language model, receiving the resume abstract text which is output by the target language model and is related to the target resume file.
Optionally, the receiving unit includes: the paragraph segmentation module is used for segmenting the paragraphs of the target resume file to obtain paragraph sets; the first word segmentation module is used for carrying out word segmentation processing on each resume segment in the segment set; and the serialization processing module is used for carrying out serialization processing on all resume word information obtained after word segmentation processing to obtain the resume sequence information.
Optionally, the target language model is pre-trained, and when training the target language model, the method includes: acquiring an initial resume set, wherein the initial resume set comprises N historical resume texts; preprocessing each history resume text in the initial resume set to obtain a resume word segmentation set corresponding to each history resume text; shielding part of resume word segmentation in the resume word segmentation set to obtain a shielding resume word segmentation set; inputting the history resume text and the corresponding shielding resume word segmentation set into an initial pre-training model, and outputting history abstract information in a specified format; the history abstract information output by the initial pre-training model is adjusted by adopting the history resume text and the history resume abstract result extracted in the history process, and iterative training is carried out on the initial pre-training model; and under the condition that the similarity between the historical abstract information output by the initial pre-training model and the historical resume abstract result is larger than a preset similarity threshold, confirming that model training is finished, and obtaining the target language model.
Optionally, the generating device of the resume abstract includes, when preprocessing each history resume text in the initial resume set to obtain a resume word segmentation set corresponding to each history resume text: the second word segmentation module is used for carrying out word segmentation processing on the history resume text by adopting a preset word segmentation device; and the format conversion module is used for converting the initial word segmentation sequence obtained after word segmentation processing into a symbol format which accords with the symbol format which can be used in the initial pre-training model, and obtaining a resume word segmentation set corresponding to each history resume text.
Optionally, the device for generating the resume abstract further comprises: the first acquisition module is used for acquiring word segmentation word lists containing all sentence word segmentation before acquiring the initial resume set; and the training module is used for training the initial word segmentation device by the word segmentation vocabulary and a plurality of resume formats.
Optionally, the device for generating the resume abstract further comprises: the second acquisition module is used for acquiring a natural language processing model BERT before acquiring an initial resume set, and taking the natural language processing model BERT as the initial pre-training model; the first configuration module is configured to configure network parameters and network structures of M networks in the initial pre-training model, where the M networks at least include: a backbone network and a language processing network, M is a positive integer greater than 1; and the control module is used for controlling the attention range of each sentence word segmentation through mask moment for each level in the backbone network.
Optionally, the device for generating the resume abstract further comprises: the third acquisition module is used for acquiring a representation joint word vector, a position vector and the information of a text segment of a character of the resume segmentation word from the target language model after the target resume file and the resume sequence information are input into the target language model, wherein the representation joint word vector is used for indicating the joint word vector which has contextual association with the resume segmentation word; the input module is used for inputting the characterization joint word vector, the position vector and the information of the text segment to T language processing networks in the target language model, wherein T is smaller than M and is a positive integer greater than or equal to 1; the analysis module is used for carrying out sentence association analysis on each resume word in the resume sequence information by adopting a self-attention mechanism by the language processing network, and determining a text word segmentation vector corresponding to the resume word segmentation, wherein the text word segmentation vector is used for determining resume abstract information of association-specified resume attributes in the target resume file, so as to obtain the resume abstract text.
According to another aspect of the embodiment of the present invention, there is further provided a computer readable storage medium, where the computer readable storage medium includes a stored computer program, where when the computer program runs, the device where the computer readable storage medium is located is controlled to execute the method for generating the summary of any one of the foregoing.
According to another aspect of the embodiment of the present invention, there is further provided an electronic device, including one or more processors and a memory, where the memory is configured to store one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement a method for generating a summary of any one of the foregoing.
In the method, when a resume abstract is generated, a target resume file is received, the target resume file is preprocessed to obtain resume sequence information, the target resume file and the resume sequence information are input into a target language model, and the resume abstract text associated with the target resume file is received from the target language model.
In the method, after the resume file is preprocessed, the resume abstract generation can be realized on the resume file and the resume sequence through the target language model obtained through pre-training, the resume abstract prediction is collected in advance through the target language model, and the resume abstract information extraction accuracy can be improved through the model with massive data training, so that the technical problem that the generated resume abstract information is inaccurate due to word frequency statistics only when the resume abstract is generated in the related art is solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiments of the invention and together with the description serve to explain the invention and do not constitute a limitation on the invention. In the drawings:
FIG. 1 is a flow chart of an alternative method of summary generation in accordance with an embodiment of the present invention;
FIG. 2 is a schematic diagram of an alternative resume summary generation apparatus according to an embodiment of the invention;
fig. 3 is a block diagram of a hardware structure of an electronic device (or mobile device) for a resume digest generation method according to an embodiment of the present invention.
Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
To facilitate an understanding of the invention by those skilled in the art, some terms or nouns involved in the various embodiments of the invention are explained below:
summary generation Question Generation, QG for short, refers to generating key information from an article, where the key information can be obtained from the article.
Word Embedding, word-Embedding, word segmentation Embedding processing is realized through a language model in natural language processing, and high-dimensional sparse space with dimension of all sub-numbers can be embedded into a low-dimensional dense vector space assembly.
UNILM: unified Language Model Pre-training for Natural Language Understanding and Generation, a unified pre-trained language model, can accomplish unidirectional, sequence-to-sequence, bi-directional tasks.
Natural language processing, natural Language Processing, NLP for short, realizes intelligent text analysis, voice recognition and natural language generation.
And (3) pre-training a model, and learning text abstract tokens on a massive corpus data set, wherein the text abstract tokens can be predicted based on text characterization information of the context. The deep learning has high requirements on data, particularly the quantity of marked data, and the pre-training model can apply strong characterization capability to various tasks, so that the problem that a large quantity of marked data is lacking in certain tasks is solved.
It should be noted that, the method and the device for generating the resume abstract in the present disclosure may be used in the artificial intelligence technical field under the condition of generating the resume abstract, and may also be used in any field other than the artificial intelligence technical field under the condition of generating the resume abstract, and the application fields of the method and the device for generating the resume abstract in the present disclosure are not limited.
It should be noted that, related information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present disclosure are information and data authorized by a user or sufficiently authorized by each party, and the collection, use and processing of related data need to comply with related laws and regulations and standards of related countries and regions, and be provided with corresponding operation entries for the user to select authorization or rejection. For example, an interface is provided between the system and the relevant user or institution, before acquiring the relevant information, the system needs to send an acquisition request to the user or institution through the interface, and acquire the relevant information after receiving the consent information fed back by the user or institution.
Since the text summaries can be classified into various types according to different classification standards, for example, whether the text summaries are marked according to the data or not, the text summaries can be classified into two types of supervised and unsupervised, wherein the supervised method requires that the training data are marked, namely, each reference text has a corresponding reference summary, the model learns the mapping relation between the reference text and the reference summary, and the unsupervised method does not need to use any marked training data, and only searches the reference text to find out the part which can represent the main information of the reference text as generated summary information.
In the process of generating the resume abstract, because of lack of labeling data, an unsupervised algorithm is selected for reducing the construction cost. The conventional abstract generation mainly uses word frequency-based statistical algorithms, such as TextRank, and the idea is that the word frequency-based statistical algorithm is based on a page ranking algorithm, namely Pagerank, and is a graph-based ranking algorithm for texts. The text is divided into a plurality of constituent units (sentences), a node connection diagram is constructed, similarity among the sentences is used as the weight of the edges, the TextRank value of the sentences is calculated through loop iteration, and finally the sentences with high rank are extracted and combined into the text abstract. In automatic summarization, textRank regards each sentence in the text as a node, and if two sentences have similarity, then an undirected weighted edge exists between the nodes corresponding to the two sentences.
However, the abstract generation mode is very dependent on data training, only statistics is carried out on word frequency, semantic information is not provided, and the accuracy of the generated resume abstract is low.
Therefore, the target language model is introduced, the target language model is a pre-training model, the strong capability of the pre-training model in natural language understanding is introduced, good effects can be achieved in abstract generation tasks of resume text, the pre-training model and downstream tasks are generally better in various tasks than the traditional abstract generation mode, the mode is applied to Chinese vertical fields, and the research on abstract generation in the resume aspect is less in the industry. In addition, the pre-training model obtains optimal effects in various NLP tasks in recent years, the characterization capability of the model trained on massive data on abstract word token is far higher than that of the model trained on traditional word vector, and the effect that the accuracy of the abstract can be improved when the pre-training model is applied to Chinese abstract generation.
The embodiments of the invention can be applied to various recruitment platforms/government problem generation platforms/systems/applications/devices for generating the abstract of the resume text, in particular to recruitment platform servers or software requiring the use of the resume of each company. In the invention, the NLG pre-training model UNILM is used for carrying out tasks such as recruitment resume abstract generation, government problem generation and the like. And meanwhile, for a UNILM model without a Chinese open source, an open source Chinese BERT model is selected for initialization, and the characterization of the field knowledge by the training fine tuning enhancement model is performed in recruitment resume abstract generation and self-built government affair data. In addition, the abstract can be automatically generated in batches by adopting an application end-to-end structure mode.
The present invention will be described in detail with reference to the following examples.
Example 1
According to an embodiment of the present invention, there is provided an embodiment of a method for generating a summary abstract, it being noted that steps shown in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is shown in the flowcharts, in some cases steps shown or described may be performed in an order different from that herein.
Fig. 1 is a flowchart of an alternative resume summary generation method according to an embodiment of the present invention, as shown in fig. 1, the method includes the following steps:
step S101, receiving a target resume file, and preprocessing the target resume file to obtain resume sequence information;
step S102, inputting the target resume file and resume sequence information into the target language model, and receiving the resume abstract text of the target resume file and outputting the associated target resume file by the target language model.
Through the steps, when the resume abstract is generated, the target resume file is received, the target resume file is preprocessed to obtain resume sequence information, the target resume file and the resume sequence information are input into the target language model, and the resume abstract text associated with the target resume file is received by the target language model. In the embodiment, after the resume file is preprocessed, the resume abstract generation can be realized on the resume file and the resume sequence through the target language model obtained through pre-training, the resume abstract prediction is collected in advance through the target language model, and the resume abstract information extraction accuracy can be improved through the model with massive data training, so that the technical problem that the generated resume abstract information is inaccurate due to word frequency statistics only when the resume abstract is generated in the related art is solved.
Embodiments of the present invention will be described in detail with reference to the following steps.
Types of target language models used in embodiments of the present invention may include, but are not limited to: UNILM model, which contains deep fransformer network, the pre-training process uses 3 unsupervised language model targets: a bi-directional language model, a uni-directional language model, and an end-to-end language model (sequence-to-sequence LM). The model uses a transform network of shared parameters while also using a specific self-attention mask matrix (self-attention masks) to control the context information used in the prediction.
When the downstream tasks are fine-tuned, the UNILM model can be regarded as unidirectional coding, bidirectional coding and end-to-end model to adapt to different downstream tasks (natural language understanding and resume abstract generating tasks). And using an end-to-end structure in the task of generating the questions, inputting texts and intervals of the questions in the texts, and outputting answers.
The target language model is required to be trained before being used in real time, and optionally, the target language model is pre-trained, and when training the target language model, the training method comprises the following steps: acquiring an initial resume set, wherein the initial resume set comprises N historical resume texts; preprocessing each history resume text in the initial resume set to obtain a resume word segmentation set corresponding to each history resume text; shielding partial resume word segmentation in the resume word segmentation set to obtain a shielding resume word segmentation set; inputting the history resume text and the corresponding shielding resume word segmentation set into an initial pre-training model, and outputting history abstract information in a specified format; the method comprises the steps of adjusting historical abstract information output by an initial pre-training model by using a historical resume text and a historical resume abstract result extracted in a historical process, and performing iterative training on the initial pre-training model; and under the condition that the similarity between the historical abstract information output by the initial pre-training model and the historical resume abstract result is larger than a preset similarity threshold, confirming that the model training is finished, and obtaining the target language model.
It should be noted that, when the initial resume set is obtained, the history resume text may be downloaded from various open-source chinese pre-training platforms, for example, the open-source Bert chinese pre-training model parameters and the history resume text may be downloaded as initialization parameters, where the initialization parameters have strong characterization capability and consistent model structure.
After the initial resume set is obtained, data preprocessing is carried out, a word segmentation device (for example, a token segmentation kit) is adopted to segment words (token) of input resume information, then token is converted into a token ID which is needed to be corresponding in an initial pre-training model, and then the token ID is converted into an input format needed by the model. The task of pre-training is the same as the task of model self training, and targets are respectively unidirectional, bidirectional and end-to-end tasks.
In the iterative training process of the model, the historical abstract information output by the initial pre-training model is adjusted, for example, open-source question-answer data (for example, a Dureader data set and a wiki encyclopedia document abstract open-source corpus) are used for parameter adjustment, and the data content is fine-tuned for the document (D) and the abstract (S)), so that the effect of improving the abstract generation accuracy is achieved.
The adjustment process is similar to pre-training using a self-attention mask, for example, in an end-to-end mode. The structure is "[ SOS ] S1[ EOS ] S2[ EOS ]", S1 and S2 respectively represent a source sequence and a target sequence, and an input [ SOS ] S1[ EOS ] S2[ EOS ] is constructed. In the fine tuning process, a certain proportion of word-segmentation token in the target sequence is randomly masked, so that the model learns to recover the masked words, and the training aim is to maximize the likelihood of the masked token based on the context. Model parameters are trained by randomly masking words in the question, letting the model learn how to recover the masked words.
In the embodiment of the invention, the structure mode of the end-to-end seq-seq is applied, and the resume abstract information can be automatically generated in batches.
Wherein, during model pre-training, partial words can be randomly shielded/covered (characterized by mask), and some token can be randomly covered (mask), and the target task is to restore the words. The input X is a series of sequences, text fragments, and the sequence representation is the same as BERT, including word vectors, position codes and sequence codes (segment embedding), and the sequence codes can be used as a one-way language model, a two-way language model and an end-to-end training mode.
When masking the input partial word, the masking operation is randomly masking, which is replaced with a predefined MASK. The output vector is then calculated by inputting the output vector into a transducer network, and the output vector is then input into a softmax classifier to predict the masked word.
Wherein parameters of the UNILM model are learned by minimizing cross entropy loss between predicted and original true tokens.
According to the embodiment of the invention, the task of resume abstract generation is selected from the NLG problem generation model UNILM, and semantic information in the document is summarized in the generated abstract.
Optionally, the step of preprocessing each history resume text in the initial resume set to obtain a resume word segmentation set corresponding to each history resume text includes: performing word segmentation processing on the history resume text by adopting a preset word segmentation device; and converting the initial word segmentation sequence obtained after word segmentation processing into a symbol format which accords with the symbol format which can be used in the initial pre-training model, and obtaining a resume word segmentation set corresponding to each history resume text.
Optionally, before the initial resume set is acquired, the method further includes: obtaining a word segmentation vocabulary containing all sentence word segmentation; the initial word segmentation device is trained by the word segmentation vocabulary and a plurality of resume formats.
Optionally, before the initial resume set is acquired, the method further includes: acquiring a natural language processing model BERT, and taking the natural language processing model BERT as an initial pre-training model; configuring network parameters and network structures of M networks in the initial pre-training model, wherein the M networks at least comprise: a backbone network and a language processing network, M is a positive integer greater than 1; the attention scope of each sentence word is controlled by mask moment for each level in the backbone network.
The number of layers included in the backbone network is defined by itself, for example, the backbone network is configured by 24 layers of transformers, and the output of each layer is the input of the next layer. Each layer controls the attention range of each sentence word segmentation through the mask moment, so that the joint training of a plurality of training targets is ensured. Given the input sequence x=x1 … xn, (x 1 is a word), each sentence is segmented by a multi-layer transform network to obtain a vector representation with context information.
According to the embodiment of the invention, the open-source Chinese BERT model is selected for initialization, and the processing capacity of the model for generating resume abstract information for the abstract is enhanced by training fine tuning on the open-source Chinese data set.
Step S101, receiving a target resume file, and preprocessing the target resume file to obtain resume sequence information.
Optionally, the step of preprocessing the target resume file to obtain resume sequence information includes: performing paragraph segmentation on the target resume file to obtain a paragraph set; word segmentation processing is carried out on each resume segment in the paragraph set; and carrying out serialization processing on all resume word information obtained after word segmentation processing to obtain resume sequence information.
For example, when generating the abstract of the recruitment resume, the generating process may use the model to perform the decoder decoding process, input the paragraph and the problem section, perform the encoder encoding, preprocess the resume information, extract the information, segment the paragraph, segment the word, and decode to generate the related segment.
Step S102, inputting the target resume file and resume sequence information into the target language model, and receiving the resume abstract text of the target resume file and outputting the associated target resume file by the target language model.
Optionally, after inputting the target resume file and resume sequence information into the target language model, the method further comprises: acquiring a representation joint word vector, a position vector and affiliated text segment information of the resume word by a target language model, wherein the representation joint word vector is used for indicating the joint word vector which has contextual association with the resume word; inputting the characteristic joint word vector, the position vector and the information of the text segment to T language processing networks in a target language model, wherein T is smaller than M and is a positive integer greater than or equal to 1; and carrying out sentence association analysis on each resume word in the resume sequence information by a language processing network by adopting a self-attention mechanism, and determining a text word segmentation vector corresponding to the resume word, wherein the text word segmentation vector is used for determining resume abstract information of association-specified resume attributes in a target resume file to obtain resume abstract text.
For example, by inputting the token joint word vector, position vector and text segment information of the word sequence, the input vector is input into the multi-layer transducer network, and the token of the text is calculated by combining the whole input sequence by using the self-attention mechanism. And carrying out UNILM model coding on the pretreated resume content, carrying out model calculation, and decoding to generate resume abstract information to obtain resume abstract text.
The specified resume attributes mentioned above may refer to some attributes associated with resume demand information, such as, for example, an academic, a academic, an age, a graduation school, a specialty, a working time, a salary section, and the like. Through extracting the appointed resume attributes, the resume abstract text is determined, the subsequent resume abstract information display can be conveniently carried out, the use of types of staff such as clients and recruiters is convenient, and the satisfaction degree of the clients is improved.
Through the embodiment, abstract information can be effectively and automatically generated from the resume text, key information can be extracted from a large number of resume, and the screening time and cost of the follow-up resume are saved.
The following describes in detail another embodiment.
Example two
The apparatus for generating a summary abstract provided in this embodiment includes a plurality of implementation units, each implementation unit corresponding to each implementation step in the first embodiment.
Fig. 2 is a schematic diagram of an alternative resume summary generating apparatus according to an embodiment of the present invention, and as shown in fig. 2, the generating apparatus may include: a receiving unit 20, a generating unit 21, wherein,
the receiving unit 20 is configured to receive the target resume file, and perform preprocessing on the target resume file to obtain resume sequence information;
The generating unit 21 is configured to input the target resume file and resume sequence information to the target language model, and receive the resume abstract text of the target resume file associated with the output of the target language model.
In the device for generating the resume abstract, when the resume abstract is generated, the receiving unit receives the target resume file, and pre-processes the target resume file to obtain resume sequence information, and the generating unit 21 inputs the target resume file and the resume sequence information into the target language model, and receives the resume abstract text of the target language model and outputs the resume abstract text associated with the target resume file. In the embodiment, after the resume file is preprocessed, the resume abstract generation can be realized on the resume file and the resume sequence through the target language model obtained through pre-training, the resume abstract prediction is collected in advance through the target language model, and the resume abstract information extraction accuracy can be improved through the model with massive data training, so that the technical problem that the generated resume abstract information is inaccurate due to word frequency statistics only when the resume abstract is generated in the related art is solved.
Optionally, the receiving unit includes: the paragraph segmentation module is used for segmenting the paragraphs of the target resume file to obtain paragraph sets; the first word segmentation module is used for carrying out word segmentation processing on each resume segment in the segment set; the serialization processing module is used for carrying out serialization processing on all resume word information obtained after word segmentation processing to obtain resume sequence information.
Optionally, the target language model is pre-trained, and when training the target language model, the method includes: acquiring an initial resume set, wherein the initial resume set comprises N historical resume texts; preprocessing each history resume text in the initial resume set to obtain a resume word segmentation set corresponding to each history resume text; shielding partial resume word segmentation in the resume word segmentation set to obtain a shielding resume word segmentation set; inputting the history resume text and the corresponding shielding resume word segmentation set into an initial pre-training model, and outputting history abstract information in a specified format; the method comprises the steps of adjusting historical abstract information output by an initial pre-training model by using a historical resume text and a historical resume abstract result extracted in a historical process, and performing iterative training on the initial pre-training model; and under the condition that the similarity between the historical abstract information output by the initial pre-training model and the historical resume abstract result is larger than a preset similarity threshold, confirming that the model training is finished, and obtaining the target language model.
Optionally, the generating device of the resume abstract includes, when preprocessing each history resume text in the initial resume set to obtain a resume word segmentation set corresponding to each history resume text: the second word segmentation module is used for carrying out word segmentation processing on the history resume text by adopting a preset word segmentation device; the format conversion module is used for converting the initial word segmentation sequence obtained after word segmentation processing into a symbol format which accords with the symbol format which can be used in the initial pre-training model, and obtaining a resume word segmentation set corresponding to each history resume text.
Optionally, the device for generating the resume abstract further comprises: the first acquisition module is used for acquiring word segmentation word lists containing all sentence word segmentation before acquiring the initial resume set; the training module is used for training to obtain an initial word segmentation device by the word segmentation vocabulary and a plurality of resume formats.
Optionally, the device for generating the resume abstract further comprises: the second acquisition module is used for acquiring a natural language processing model BERT before acquiring the initial resume set, and taking the natural language processing model BERT as an initial pre-training model; the first configuration module is configured to configure network parameters and network structures of M networks in the initial pre-training model, where the M networks at least include: a backbone network and a language processing network, M is a positive integer greater than 1; and the control module is used for controlling the attention range of each sentence word segmentation through mask moment for each level in the backbone network.
Optionally, the device for generating the resume abstract further comprises: the third acquisition module is used for acquiring a characteristic joint word vector, a position vector and text segment information of the resume segmentation word by the target language model after the target resume file and resume sequence information are input into the target language model, wherein the characteristic joint word vector is used for indicating the joint word vector which has contextual association with the resume segmentation word; the input module is used for inputting the characteristic joint word vector, the position vector and the information of the text segment to T language processing networks in the target language model, wherein T is smaller than M, and T is a positive integer greater than or equal to 1; the analysis module is used for carrying out sentence association analysis on each resume word in the resume sequence information by a language processing network through a self-attention mechanism, and determining a text word segmentation vector corresponding to the resume word, wherein the text word segmentation vector is used for determining resume abstract information of association designated resume attributes in a target resume file, so as to obtain resume abstract text.
The apparatus for generating a summary of the resume may further include a processor and a memory, wherein the receiving unit 20, the generating unit 21, and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.
The processor includes a kernel, and the kernel fetches a corresponding program unit from the memory. The kernel can set one or more than one, and the resume abstract information of the resume text is generated by adjusting the kernel parameters.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM), which includes at least one memory chip.
The present application also provides a computer program product adapted to perform, when executed on a data processing device, a program initialized with the method steps of: receiving a target resume file, and preprocessing the target resume file to obtain resume sequence information; inputting the target resume file and the resume sequence information into a target language model, and receiving the resume abstract text which is output by the target language model and is related to the target resume file.
According to another aspect of the embodiment of the present invention, there is also provided a computer readable storage medium, where the computer readable storage medium includes a stored computer program, where the computer program when executed controls a device in which the computer readable storage medium is located to execute the method for generating the resume summary of any one of the above.
According to another aspect of the embodiment of the present invention, there is also provided an electronic device, including one or more processors and a memory, where the memory is configured to store one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement a method for generating a summary of any one of the foregoing.
Fig. 3 is a block diagram of a hardware structure of an electronic device (or mobile device) for a resume digest generation method according to an embodiment of the present invention. As shown in fig. 3, the electronic device may include one or more (shown in fig. 3 as 302a, 302b, … …,302 n) processors 302 (the processor 102 may include, but is not limited to, a microprocessor MCU, a programmable logic device FPGA, etc. processing means), a memory 304 for storing data. In addition, the method may further include: a display, an input/output interface (I/O interface), a Universal Serial Bus (USB) port (which may be included as one of the ports of the I/O interface), a network interface, a keyboard, a power supply, and/or a camera. It will be appreciated by those of ordinary skill in the art that the configuration shown in fig. 3 is merely illustrative and is not intended to limit the configuration of the electronic device described above. For example, the electronic device may also include more or fewer components than shown in FIG. 3, or have a different configuration than shown in FIG. 3.
The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
In the foregoing embodiments of the present invention, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.
In the several embodiments provided in the present application, it should be understood that the disclosed technology content may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, for example, may be a logic function division, and may be implemented in another manner, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims (10)

1. The method for generating the resume abstract is characterized by comprising the following steps of:
receiving a target resume file, and preprocessing the target resume file to obtain resume sequence information;
inputting the target resume file and the resume sequence information into a target language model, and receiving the resume abstract text which is output by the target language model and is related to the target resume file.
2. The method of generating according to claim 1, wherein the step of preprocessing the target resume file to obtain resume sequence information includes:
performing paragraph segmentation on the target resume file to obtain a paragraph set;
word segmentation processing is carried out on each resume segment in the paragraph set;
and carrying out serialization processing on all resume word information obtained after word segmentation processing to obtain resume sequence information.
3. The method according to claim 1, wherein the target language model is trained in advance, and when training the target language model, the method comprises:
Acquiring an initial resume set, wherein the initial resume set comprises N historical resume texts;
preprocessing each history resume text in the initial resume set to obtain a resume word segmentation set corresponding to each history resume text;
shielding part of resume word segmentation in the resume word segmentation set to obtain a shielding resume word segmentation set;
inputting the history resume text and the corresponding shielding resume word segmentation set into an initial pre-training model, and outputting history abstract information in a specified format;
the history abstract information output by the initial pre-training model is adjusted by adopting the history resume text and the history resume abstract result extracted in the history process, and iterative training is carried out on the initial pre-training model;
and under the condition that the similarity between the historical abstract information output by the initial pre-training model and the historical resume abstract result is larger than a preset similarity threshold, confirming that model training is finished, and obtaining the target language model.
4. The method of generating according to claim 3, wherein the step of preprocessing each of the history resume texts in the initial resume set to obtain a resume word segmentation set corresponding to each of the history resume texts includes:
Performing word segmentation processing on the history resume text by adopting a preset word segmentation device;
and converting the initial word segmentation sequence obtained after word segmentation processing into a symbol format which accords with the symbol format which can be used in the initial pre-training model, and obtaining a resume word segmentation set corresponding to each history resume text.
5. The method of generating of claim 4, further comprising, prior to obtaining the initial set of resumes:
obtaining a word segmentation vocabulary containing all sentence word segmentation;
training to obtain the initial word segmentation device by the word segmentation vocabulary and a plurality of resume formats.
6. The generation method according to claim 3, further comprising, before acquiring the initial resume set:
acquiring a natural language processing model BERT, and taking the natural language processing model BERT as the initial pre-training model;
configuring network parameters and network structures of M networks in the initial pre-training model, wherein the M networks at least comprise: a backbone network and a language processing network, M is a positive integer greater than 1;
the attention scope of each sentence word segmentation is controlled through mask moment for each level in the backbone network.
7. The generation method according to claim 1, further comprising, after inputting the target resume file and the resume sequence information to a target language model:
Acquiring a representation joint word vector, a position vector and affiliated text segment information of a resume word by the target language model, wherein the representation joint word vector is used for indicating the joint word vector which has contextual association with the resume word;
inputting the characterization joint word vector, the position vector and the text segment information to T language processing networks in the target language model, wherein T is smaller than M and is a positive integer greater than or equal to 1;
and carrying out sentence association analysis on each resume word in the resume sequence information by the language processing network by adopting a self-attention mechanism, and determining a text word segmentation vector corresponding to the resume word, wherein the text word segmentation vector is used for determining resume abstract information of association-specified resume attributes in the target resume file to obtain the resume abstract text.
8. The device for generating the resume abstract is characterized by comprising the following components:
the receiving unit is used for receiving the target resume file and preprocessing the target resume file to obtain resume sequence information;
the generation unit is used for inputting the target resume file and the resume sequence information into a target language model, receiving the resume abstract text which is output by the target language model and is related to the target resume file.
9. A computer readable storage medium, characterized in that the computer readable storage medium comprises a stored computer program, wherein the computer program when run controls a device in which the computer readable storage medium is located to perform the method for generating a resume abstract according to any one of claims 1 to 7.
10. An electronic device comprising one or more processors and a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of generating a resume digest of any of claims 1-7.
CN202310282784.0A 2023-03-21 2023-03-21 Resume abstract generation method and device, electronic equipment and storage medium Pending CN116521856A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310282784.0A CN116521856A (en) 2023-03-21 2023-03-21 Resume abstract generation method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310282784.0A CN116521856A (en) 2023-03-21 2023-03-21 Resume abstract generation method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116521856A true CN116521856A (en) 2023-08-01

Family

ID=87400152

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310282784.0A Pending CN116521856A (en) 2023-03-21 2023-03-21 Resume abstract generation method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116521856A (en)

Similar Documents

Publication Publication Date Title
CN112084337B (en) Training method of text classification model, text classification method and equipment
CN111241237B (en) Intelligent question-answer data processing method and device based on operation and maintenance service
CN108334489B (en) Text core word recognition method and device
CN110472255B (en) Neural network machine translation method, model, electronic terminal, and storage medium
CN111931517A (en) Text translation method and device, electronic equipment and storage medium
CN114676234A (en) Model training method and related equipment
CN112329476A (en) Text error correction method and device, equipment and storage medium
CN111767697B (en) Text processing method and device, computer equipment and storage medium
CN112188312A (en) Method and apparatus for determining video material of news
CN108509539B (en) Information processing method and electronic device
CN111597807B (en) Word segmentation data set generation method, device, equipment and storage medium thereof
CN116050352A (en) Text encoding method and device, computer equipment and storage medium
CN113239668B (en) Keyword intelligent extraction method and device, computer equipment and storage medium
CN114416981A (en) Long text classification method, device, equipment and storage medium
CN113449081A (en) Text feature extraction method and device, computer equipment and storage medium
CN113342932B (en) Target word vector determining method and device, storage medium and electronic device
CN115587184A (en) Method and device for training key information extraction model and storage medium thereof
CN116521856A (en) Resume abstract generation method and device, electronic equipment and storage medium
CN115617959A (en) Question answering method and device
CN115114917A (en) Military named entity recognition method and device based on vocabulary enhancement
CN110688487A (en) Text classification method and device
CN116913278B (en) Voice processing method, device, equipment and storage medium
CN114818644B (en) Text template generation method, device, equipment and storage medium
CN113591493B (en) Translation model training method and translation model device
CN117876940B (en) Video language task execution and model training method, device, equipment and medium thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination