CN111967252A

CN111967252A - Entity word representation learning method and device, computer equipment and storage medium

Info

Publication number: CN111967252A
Application number: CN202010890302.6A
Authority: CN
Inventors: 李夏昕; 孙璨; 张永平
Original assignee: Shenzhen Bailao Intelligent Co ltd
Current assignee: Shenzhen Bailao Intelligent Co ltd
Priority date: 2020-08-29
Filing date: 2020-08-29
Publication date: 2020-11-20

Abstract

The application particularly discloses a method, a device, computer equipment and a storage medium for learning representation of entity words, wherein the method comprises the steps of crawling jd data disclosed on the Internet to obtain target entity words t; representing the target entity word t as a document consisting of entity words co-existing with the target entity word t; training a tf-idf model by using the constructed document; carrying out L1 normalization on vector representation of the document under the tf-idf model, and carrying out dimension sequencing and dimension truncation; sampling based on the dimension value of the generated vector to construct an embedding training corpus; and training the constructed corpus by using a traditional embedding model to obtain a representation model.

Description

Entity word representation learning method and device, computer equipment and storage medium

Technical Field

The application relates to the technical field of language processing, in particular to a human resource field entity word representation learning method and device based on optimized weight sampling, computer equipment and a storage medium.

Background

When the entity words are expressed by the prior art, the context co-occurring in the text with the target entity word T is generally directly adopted as the expression of T. For example, when T is a jobtitle in a JD. There are two common practices:

1. the vector space model is built by using entity words in the jd text, each dimension value of the vector is calculated by using tf-idf value or each variant algorithm thereof, and then a jobtitle is expressed as a vector in the vector space.

2. Combining a target entity word T and a context word thereof into a sentence, constructing an embedding model, then training the model by using the embedding methods of word2vector, gLove, fasttext and the like, and finally expressing job titLe le as a vector in an embedding space.

The final target of the two methods is to represent the entity words into a vector in a fixed space, and the similarity calculation of the two entity words can be realized by using the vector; or given an entity word, returning the topn entity words that are most semantically similar to the entity word.

The vector space model, the embedding model and the embedding model have inherent defects.

The vector dimension calculated by the vector space model is usually very high and very sparse, the similarity calculation operation between two vectors is relatively slow, and the dimension number of the vectors is usually limited to several hundred orders in an actual application scene so as to meet the calculation efficiency requirement of an on-line system. But defining dimensions means discarding useful information and compromising the representation of the words of the entity. Meanwhile, the entity word vectors in the vector space model are generally likely to have high variance in the value of any dimension, which results in that if the values of several dimensions of two vectors are different, the similarity between the two vectors is low. This is not compatible with the digital application level requirements in the human resources field. When the difference between different entity words is calculated by using the vector in the vector space, the trend of the change curve is also very oscillatory and not linear enough.

The vectors calculated by the embedding model are low-dimensional dense vectors generally, and the calculation efficiency is high. And the difference value of the similarity between different entities is linear, smooth and comparable. The rationale behind the various embedding models is that "the semantics of a word are described by context words in the text that are near the word". This rationale will lead to the fact that if there are some low frequency words or mispronunciations with long tails in the original text, these low frequency words or mispronunciations will have a very high similarity to the target word if such words co-occur with the target word in the same context window each time they occur. This can result in semantic offsets in the entire embedding space, thereby degrading the semantic accuracy of the representation vector of the target word.

Disclosure of Invention

The application provides an entity word representation learning method, which aims to solve the problems.

In a first aspect, the present application provides a method for learning representation of an entity word in the field of human resources, the method including:

crawling jd data disclosed on the Internet to obtain a target entity word t;

representing the target entity word t as a document consisting of entity words co-existing with the target entity word t;

training a tf-idf model by using the constructed document;

carrying out L1 normalization on vector representation of the document under the tf-idf model, and carrying out dimension sequencing and dimension truncation;

sampling based on the dimension value of the generated vector to construct an embedding training corpus;

and training the constructed corpus by using a traditional embedding model to obtain a representation model.

In a second aspect, the present application further provides an entity word representing apparatus, including:

the data acquisition unit is used for crawling jd data disclosed on the Internet to acquire a target entity word t;

a document composition unit for representing the target entity word t as a document composed of entity words co-occurring with the target entity word t;

a tf-idf model unit is constructed, and the constructed documents are used for training a tf-idf model;

the normalizing unit is used for normalizing the vector representation of the document under the tf-idf model by L1, and performing dimension sequencing and dimension truncation;

the corpus unit is used for sampling based on the dimension value of the generated vector to construct an embedding training corpus;

and the model training unit is used for training on the constructed corpus by adopting a traditional embedding model to obtain the representation model.

In a third aspect, the present application further provides a computer device comprising a memory and a processor; the memory is used for storing a computer program; the processor is configured to execute the computer program and implement the entity word representation learning method as described above when executing the computer program.

In a fourth aspect, the present application further provides a computer-readable storage medium storing a computer program, which when executed by a processor causes the processor to implement the entity word representation learning method as described above.

The application discloses a method, a device, equipment and a storage medium for learning representation of entity words, which are used for obtaining target entity words t by crawling jd data disclosed on the Internet; representing the target entity word t as a document consisting of entity words co-existing with the target entity word t; training a tf-idf model by using the constructed document; carrying out L1 normalization on vector representation of the document under the tf-idf model, and carrying out dimension sequencing and dimension truncation; sampling based on the dimension value of the generated vector to construct an embedding training corpus; and training the constructed corpus by using a traditional embedding model to obtain a representation model. The method can more accurately represent various entity words, and help the system to more accurately understand the semantic matching degree of the entity words in JD and CV, thereby recalling the result more in line with the expectation of the system user.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic flow chart diagram of a method for entity word representation learning provided by an embodiment of the present application;

FIG. 2 is a flow diagram illustrating sub-steps of a method for entity word representation learning provided by an embodiment of the present application;

FIG. 3 is a schematic block diagram of a model training apparatus according to an embodiment of the present disclosure;

fig. 4 is a schematic block diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The flow diagrams depicted in the figures are merely illustrative and do not necessarily include all of the elements and operations/steps, nor do they necessarily have to be performed in the order depicted. For example, some operations/steps may be decomposed, combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

The embodiment of the application provides a human resource field entity word representation learning method and device based on optimized weight sampling, computer equipment and a storage medium. The human resource field entity word representation learning method based on the optimized weight sampling can be applied to a terminal or a server to more accurately represent entity words and help a system to more accurately understand the semantic matching degree of the entity words in JD and CV, so that a result more conforming to the expectation of a system user can be recalled.

Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.

Referring to fig. 1, fig. 1 is a schematic flowchart of a training method of a schematic flowchart of an entity word representation learning method according to an embodiment of the present application.

As shown in fig. 1, the entity word representation learning method is used for training an entity word representation model to be applied to the human resource field based on optimization weight sampling. The learning method includes steps S101 to S106.

S101, crawling jd data disclosed on the Internet to obtain target entity type words t.

The method comprises the steps of crawling mass JD data disclosed on the Internet, performing entity identification on the description of JD, and obtaining entity type words interested in a service level. Thus, jd ═ jt, n1, n2, n3, … nn }. jd is a jd corpus, where jt is jobtitle in jd, n1 to nn are entity words in jd, which are usually words of the type of tool, technology, process, certificate, product, etc. jd ∈ D, D is the set of all jd corpora that have been made entity identifications.

And S102, representing the target entity word t into a document consisting of entity words which are co-existed with the target entity word t.

Wherein, the unique jt is used as the document name, and all entity words co-occurring with the jt (namely, occurring in the same jd) are classified into the jt document. I.e., jt _ doc ═ n1, n2, n3, … nn ], these entity words n are from all jd text for which the jd titLe is jt (which may be from a number of different jd texts, but for which the jd titLe is the same, i.e., jt). An entity word n inside jt _ doc may appear multiple times. jt _ doc ∈ M, i.e., M is the set of all jt _ doc documents.

S103, training a tf-idf model by using the constructed document.

Specifically, the application trains a tf-idf model with all jt _ doc as input.

In an alternative embodiment, tf values in model training cannot be calculated using raw counts, and need to be calculated in a normalized manner. For example, term frequency ft, d/S, where ft, d is the frequency of occurrence of the entity word t in jt _ doc d, and S is the total number of all entity words in jd _ doc d. It can also be calculated by Log normaLization, i.e., term frequency Log (1+ ft, d). The two calculation formulas play roles of normalization and smoothing, and can offset the semantic deviation problem possibly caused by uneven distribution of the number of documents with different jts in some D. The dimension value of a certain entity word used to represent jt may also be appropriately weighted. The idf value in the model training can adopt a general idf calculation mode, namely, inverse document frequency is Log (N/nt). A smooth idf calculation mode, i.e., inverse document frequency smooth ═ Log (N/(1+ nt)) +1, may also be used. Where N is the number of elements in the set M, and nt is the number of documents in the set M containing the entity word t. The idf value may have a weight-down effect on high frequency words. So that even if different jts and a certain high-frequency word are co-existed, the jts can not be very similar due to the high-frequency word.

And S104, performing L1 grouping on vector representation of the document under the tf-idf model, and performing dimension sequencing and dimension truncation.

And (3) constructing an embedding model training corpus (sentence) in a sampling mode for each jt _ doc (representing a jobtitle entity word jt) in the set M.

In an alternative embodiment, please refer to fig. 2, the operation of S104 may include the following steps.

S1041, calculating tfidf vector v in vector space by using the tf-idf model trained in S103 for each jt _ doc in M.

Specifically, each dimension in the vector v is tf multiplied by idf in S103. V ∈ V, V is the set of all tfidf vectors, where each vector V corresponds to a particular entity word jt.

S1042, L1 normaLization is performed on each vector in V, so that all dimension values of a vector V are greater than or equal to 0.0, and the sum is 1.0. I.e. let the vector v be a probability distribution.

S1043, sorting the dimensionality of each vector in the V in a descending order according to the corresponding tfidf value.

Specifically, the higher the value of tfidf after sorting, the more forward the dimension position, the more relevant the entity word corresponding to the forward dimension is to the entity word jt corresponding to v (considered from two aspects of co-occurrence and rareness).

S1044 is that the vector V which is sorted in the descending order in the V is cut off from the original length L1 to a uniform new length L2.

Specifically, the value of L2 needs to be selected by observing the vector data. The value of L2 is chosen such that when the length is truncated to L2, the entity words corresponding to the first L2 vector dimensions are all the entity words related to the entity word jt, and the sum of the values of the first L2 vector dimensions (i.e., greater than a certain value r e (0.0, 1.0)). The value of r is a probability and a threshold value, and needs to be selected according to experience. The search of the two parameters, L2 and r, can be tuned in conjunction with the effects of downstream tasks.

And S105, sampling based on the dimension value of the generated vector to construct an embedding training corpus.

Specifically, each V in V corresponds to an entity word jt, each V with the dimension number L2 corresponds to a specific jt, each dimension of each V corresponds to an entity word, and the same dimension of different V may correspond to different entity words. And for each jt, sampling the entity word n corresponding to each dimension according to the value of each dimension from the v corresponding to the jt. There are many specific sampling methods, and the basic principle is to determine whether an entity word is sampled based on probability values. In a simple way, if a certain dimension has a value of q and the corresponding entity word is n, the probability of q/1 is used to decide whether to sample n. Performing traversal sampling on all dimensions of v corresponding to one jt, and assuming that sampling obtains n1, n2, n3, n4, n50, n53 and n81 which are entity words; a sentence n1, n2, n3, n4, jt, n50, n53, n81 is generated. Note that the jt word is placed in the middle of a sentence, which facilitates the setting of context window parameters of the subsequent embedding algorithm. For a jt, a plurality of sentence samples are needed to be made, and a plurality of sentences containing the jt word at the middle position are generated. The number of sentences generated for each jt is in principle the same. Specifically, the number p of sentences generated for a jt needs to be set through some heuristic rules. For example, L2 sentences are generated for each jt, which can probabilistically ensure that entity words corresponding to one dimension are most probabilistically present in a certain sentence, and specifically, constructed sample data is used.

And S106, training the constructed corpus by using a traditional embedding model to obtain a representation model.

The training corpus T of the traditional embedding algorithm is obtained through S105, and then the embedding model is trained by using the mainstream word embedding algorithm, so as to obtain the vector representation of the entity words in the corpus T. The newly acquired vector represents semantic information that may better represent entity words in T. The method has better effect on two tasks of calculating the similarity of two entity words and recalling the topn entity words with the most similar entity words.

In this embodiment, the traditional embedding model is trained by adopting a common mainstream word embedding algorithm, and the traditional embedding algorithm includes any one of word2vector, gLove and fasttext.

Referring to fig. 3, fig. 3 is a schematic block diagram of an entity representing apparatus according to an embodiment of the present application, where the entity representing apparatus may be configured in a server for executing the aforementioned training method.

As shown in fig. 3, the entity representation apparatus 200 includes: the data acquisition unit 201, the document composition unit 202, the tf-idf construction unit 203, the normalization unit 204, the corpus unit 205 and the model training unit 206.

The data acquisition unit 201 crawls jd data published on the internet to acquire a target entity word t.

The document composing unit 202 represents the target entity word t as a document composed of entity words co-occurring with the target entity word t.

A tf-idf unit 203 is constructed and the tf-idf model is trained using the constructed documents.

And a normalizing unit 204 for normalizing the vector representation of the document under the tf-idf model by L1, and performing dimension sorting and dimension truncation.

The corpus unit 205 performs sampling based on the dimension value of the generated vector to construct an embedding corpus.

The model training unit 206 performs training on the constructed corpus by using the traditional embedding model to obtain the representation model

It should be noted that, as will be clear to those skilled in the art, for convenience and brevity of description, the specific working processes of the apparatus and the units described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The apparatus described above may be implemented in the form of a computer program which is executable on a computer device as shown in fig. 4.

Referring to fig. 4, fig. 4 is a schematic block diagram of a computer device according to an embodiment of the present disclosure. The computer device may be a server or a terminal.

Referring to fig. 4, the computer device includes a processor, a memory, and a network interface connected through a system bus, wherein the memory may include a nonvolatile storage medium and an internal memory.

The non-volatile storage medium may store an operating system and a computer program. The computer program includes program instructions that, when executed, cause a processor to perform any one of the human resources domain entity word representation learning methods with motion based on optimized weight sampling.

The processor is used for providing calculation and control capability and supporting the operation of the whole computer equipment.

The internal memory provides an environment for execution of a computer program on a non-volatile storage medium, which when executed by the processor, causes the processor to perform any one of the human resources domain entity word representation learning methods that exercise optimal weight sampling based.

The network interface is used for network communication, such as sending assigned tasks and the like. Those skilled in the art will appreciate that the architecture shown in fig. 4 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

It should be understood that the Processor may be a CentraL Processing Unit (CPU), and the Processor may be other general purpose Processor, a DigitaL SignaL Processor (DSP), an AppLication Specific Integrated Circuit (ASIC), an off-the-shelf ProgrammabLe Gate Array (FPGA) or other ProgrammabLe logic device, discrete Gate or transistor logic, discrete hardware components, etc. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Wherein, in one embodiment, the processor is configured to execute a computer program stored in the memory to implement the steps of:

crawling jd data disclosed on the Internet to obtain a target entity word t; representing the target entity word t as a document consisting of entity words co-existing with the target entity word t; training a tf-idf model by using the constructed document; carrying out L1 normalization on vector representation of the document under the tf-idf model, and carrying out dimension sequencing and dimension truncation; sampling based on the dimension value of the generated vector to construct an embedding training corpus; and training the constructed corpus by using a traditional embedding model to obtain a representation model.

While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An entity word representation learning method is characterized by comprising the following steps:

crawling jd data disclosed on the Internet to obtain a target entity word t;

training a tf-idf model by using the constructed document;

2. The learning method of claim 1, wherein the training of the tf-idf model with the constructed documents comprises calculation of tf values; the calculation of the tf value comprises: term frequency ft, d/S or term frequency Log (1+ ft, d); where ft, d is the frequency with which the entity word t appears in jt _ doc d, and S is the total number of all entity words in jd _ doc d.

3. The learning method of claim 1, wherein the training of tf-idf models with constructed documents further comprises the calculation of idf values, and the tf values are calculated in a normalized manner.

4. The learning method according to claim 2, wherein the calculation of the idf value includes: an inverse document frequency Log (N/nt) or inverse document frequency wood Log (N/(1+ nt)) + 1; where N is the number of elements in set M and nt is the number of documents in set M that contain the entity word t.

5. The learning method according to claim 1, wherein the grouping vector representations of documents under tf-idf model into L1 and performing dimension sorting and dimension truncation comprises:

calculating the tfidf vector V of each jt _ doc in the training tf-idf model pair M in the vector space;

l1 normaLization is performed for each vector in V so that all dimension values of a vector V are greater than or equal to 0.0 and the sum is 1.0;

sorting the dimensionalities of each vector in the V in a descending order according to the tf-idf value corresponding to the dimensionality;

and (4) truncating the vector V which is sorted in the descending order from the original length L1 to a uniform new length L2.

6. The learning method of claim 2, wherein the conventional embedding model comprises: any one of word2vector, gLove, fasttext.

7. An entity word representation apparatus, comprising:

8. A computer device, wherein the computer device comprises a memory and a processor;

the memory is used for storing a computer program;

the processor for executing the computer program and implementing the learning method of any one of claims 1 to 7 when executing the computer program.

9. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to implement the learning method according to any one of claims 1 to 7.