CN111967252A - Entity word representation learning method and device, computer equipment and storage medium - Google Patents

Entity word representation learning method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN111967252A
CN111967252A CN202010890302.6A CN202010890302A CN111967252A CN 111967252 A CN111967252 A CN 111967252A CN 202010890302 A CN202010890302 A CN 202010890302A CN 111967252 A CN111967252 A CN 111967252A
Authority
CN
China
Prior art keywords
model
vector
training
entity word
dimension
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010890302.6A
Other languages
Chinese (zh)
Inventor
李夏昕
孙璨
张永平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Bailao Intelligent Co ltd
Original Assignee
Shenzhen Bailao Intelligent Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Bailao Intelligent Co ltd filed Critical Shenzhen Bailao Intelligent Co ltd
Priority to CN202010890302.6A priority Critical patent/CN111967252A/en
Publication of CN111967252A publication Critical patent/CN111967252A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application particularly discloses a method, a device, computer equipment and a storage medium for learning representation of entity words, wherein the method comprises the steps of crawling jd data disclosed on the Internet to obtain target entity words t; representing the target entity word t as a document consisting of entity words co-existing with the target entity word t; training a tf-idf model by using the constructed document; carrying out L1 normalization on vector representation of the document under the tf-idf model, and carrying out dimension sequencing and dimension truncation; sampling based on the dimension value of the generated vector to construct an embedding training corpus; and training the constructed corpus by using a traditional embedding model to obtain a representation model.

Description

Entity word representation learning method and device, computer equipment and storage medium
Technical Field
The application relates to the technical field of language processing, in particular to a human resource field entity word representation learning method and device based on optimized weight sampling, computer equipment and a storage medium.
Background
When the entity words are expressed by the prior art, the context co-occurring in the text with the target entity word T is generally directly adopted as the expression of T. For example, when T is a jobtitle in a JD. There are two common practices:
1. the vector space model is built by using entity words in the jd text, each dimension value of the vector is calculated by using tf-idf value or each variant algorithm thereof, and then a jobtitle is expressed as a vector in the vector space.
2. Combining a target entity word T and a context word thereof into a sentence, constructing an embedding model, then training the model by using the embedding methods of word2vector, gLove, fasttext and the like, and finally expressing job titLe le as a vector in an embedding space.
The final target of the two methods is to represent the entity words into a vector in a fixed space, and the similarity calculation of the two entity words can be realized by using the vector; or given an entity word, returning the topn entity words that are most semantically similar to the entity word.
The vector space model, the embedding model and the embedding model have inherent defects.
The vector dimension calculated by the vector space model is usually very high and very sparse, the similarity calculation operation between two vectors is relatively slow, and the dimension number of the vectors is usually limited to several hundred orders in an actual application scene so as to meet the calculation efficiency requirement of an on-line system. But defining dimensions means discarding useful information and compromising the representation of the words of the entity. Meanwhile, the entity word vectors in the vector space model are generally likely to have high variance in the value of any dimension, which results in that if the values of several dimensions of two vectors are different, the similarity between the two vectors is low. This is not compatible with the digital application level requirements in the human resources field. When the difference between different entity words is calculated by using the vector in the vector space, the trend of the change curve is also very oscillatory and not linear enough.
The vectors calculated by the embedding model are low-dimensional dense vectors generally, and the calculation efficiency is high. And the difference value of the similarity between different entities is linear, smooth and comparable. The rationale behind the various embedding models is that "the semantics of a word are described by context words in the text that are near the word". This rationale will lead to the fact that if there are some low frequency words or mispronunciations with long tails in the original text, these low frequency words or mispronunciations will have a very high similarity to the target word if such words co-occur with the target word in the same context window each time they occur. This can result in semantic offsets in the entire embedding space, thereby degrading the semantic accuracy of the representation vector of the target word.
Disclosure of Invention
The application provides an entity word representation learning method, which aims to solve the problems.
In a first aspect, the present application provides a method for learning representation of an entity word in the field of human resources, the method including:
crawling jd data disclosed on the Internet to obtain a target entity word t;
representing the target entity word t as a document consisting of entity words co-existing with the target entity word t;
training a tf-idf model by using the constructed document;
carrying out L1 normalization on vector representation of the document under the tf-idf model, and carrying out dimension sequencing and dimension truncation;
sampling based on the dimension value of the generated vector to construct an embedding training corpus;
and training the constructed corpus by using a traditional embedding model to obtain a representation model.
In a second aspect, the present application further provides an entity word representing apparatus, including:
the data acquisition unit is used for crawling jd data disclosed on the Internet to acquire a target entity word t;
a document composition unit for representing the target entity word t as a document composed of entity words co-occurring with the target entity word t;
a tf-idf model unit is constructed, and the constructed documents are used for training a tf-idf model;
the normalizing unit is used for normalizing the vector representation of the document under the tf-idf model by L1, and performing dimension sequencing and dimension truncation;
the corpus unit is used for sampling based on the dimension value of the generated vector to construct an embedding training corpus;
and the model training unit is used for training on the constructed corpus by adopting a traditional embedding model to obtain the representation model.
In a third aspect, the present application further provides a computer device comprising a memory and a processor; the memory is used for storing a computer program; the processor is configured to execute the computer program and implement the entity word representation learning method as described above when executing the computer program.
In a fourth aspect, the present application further provides a computer-readable storage medium storing a computer program, which when executed by a processor causes the processor to implement the entity word representation learning method as described above.
The application discloses a method, a device, equipment and a storage medium for learning representation of entity words, which are used for obtaining target entity words t by crawling jd data disclosed on the Internet; representing the target entity word t as a document consisting of entity words co-existing with the target entity word t; training a tf-idf model by using the constructed document; carrying out L1 normalization on vector representation of the document under the tf-idf model, and carrying out dimension sequencing and dimension truncation; sampling based on the dimension value of the generated vector to construct an embedding training corpus; and training the constructed corpus by using a traditional embedding model to obtain a representation model. The method can more accurately represent various entity words, and help the system to more accurately understand the semantic matching degree of the entity words in JD and CV, thereby recalling the result more in line with the expectation of the system user.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic flow chart diagram of a method for entity word representation learning provided by an embodiment of the present application;
FIG. 2 is a flow diagram illustrating sub-steps of a method for entity word representation learning provided by an embodiment of the present application;
FIG. 3 is a schematic block diagram of a model training apparatus according to an embodiment of the present disclosure;
fig. 4 is a schematic block diagram of a computer device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The flow diagrams depicted in the figures are merely illustrative and do not necessarily include all of the elements and operations/steps, nor do they necessarily have to be performed in the order depicted. For example, some operations/steps may be decomposed, combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
The embodiment of the application provides a human resource field entity word representation learning method and device based on optimized weight sampling, computer equipment and a storage medium. The human resource field entity word representation learning method based on the optimized weight sampling can be applied to a terminal or a server to more accurately represent entity words and help a system to more accurately understand the semantic matching degree of the entity words in JD and CV, so that a result more conforming to the expectation of a system user can be recalled.
Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.
Referring to fig. 1, fig. 1 is a schematic flowchart of a training method of a schematic flowchart of an entity word representation learning method according to an embodiment of the present application.
As shown in fig. 1, the entity word representation learning method is used for training an entity word representation model to be applied to the human resource field based on optimization weight sampling. The learning method includes steps S101 to S106.
S101, crawling jd data disclosed on the Internet to obtain target entity type words t.
The method comprises the steps of crawling mass JD data disclosed on the Internet, performing entity identification on the description of JD, and obtaining entity type words interested in a service level. Thus, jd ═ jt, n1, n2, n3, … nn }. jd is a jd corpus, where jt is jobtitle in jd, n1 to nn are entity words in jd, which are usually words of the type of tool, technology, process, certificate, product, etc. jd ∈ D, D is the set of all jd corpora that have been made entity identifications.
And S102, representing the target entity word t into a document consisting of entity words which are co-existed with the target entity word t.
Wherein, the unique jt is used as the document name, and all entity words co-occurring with the jt (namely, occurring in the same jd) are classified into the jt document. I.e., jt _ doc ═ n1, n2, n3, … nn ], these entity words n are from all jd text for which the jd titLe is jt (which may be from a number of different jd texts, but for which the jd titLe is the same, i.e., jt). An entity word n inside jt _ doc may appear multiple times. jt _ doc ∈ M, i.e., M is the set of all jt _ doc documents.
S103, training a tf-idf model by using the constructed document.
Specifically, the application trains a tf-idf model with all jt _ doc as input.
In an alternative embodiment, tf values in model training cannot be calculated using raw counts, and need to be calculated in a normalized manner. For example, term frequency ft, d/S, where ft, d is the frequency of occurrence of the entity word t in jt _ doc d, and S is the total number of all entity words in jd _ doc d. It can also be calculated by Log normaLization, i.e., term frequency Log (1+ ft, d). The two calculation formulas play roles of normalization and smoothing, and can offset the semantic deviation problem possibly caused by uneven distribution of the number of documents with different jts in some D. The dimension value of a certain entity word used to represent jt may also be appropriately weighted. The idf value in the model training can adopt a general idf calculation mode, namely, inverse document frequency is Log (N/nt). A smooth idf calculation mode, i.e., inverse document frequency smooth ═ Log (N/(1+ nt)) +1, may also be used. Where N is the number of elements in the set M, and nt is the number of documents in the set M containing the entity word t. The idf value may have a weight-down effect on high frequency words. So that even if different jts and a certain high-frequency word are co-existed, the jts can not be very similar due to the high-frequency word.
And S104, performing L1 grouping on vector representation of the document under the tf-idf model, and performing dimension sequencing and dimension truncation.
And (3) constructing an embedding model training corpus (sentence) in a sampling mode for each jt _ doc (representing a jobtitle entity word jt) in the set M.
In an alternative embodiment, please refer to fig. 2, the operation of S104 may include the following steps.
S1041, calculating tfidf vector v in vector space by using the tf-idf model trained in S103 for each jt _ doc in M.
Specifically, each dimension in the vector v is tf multiplied by idf in S103. V ∈ V, V is the set of all tfidf vectors, where each vector V corresponds to a particular entity word jt.
S1042, L1 normaLization is performed on each vector in V, so that all dimension values of a vector V are greater than or equal to 0.0, and the sum is 1.0. I.e. let the vector v be a probability distribution.
S1043, sorting the dimensionality of each vector in the V in a descending order according to the corresponding tfidf value.
Specifically, the higher the value of tfidf after sorting, the more forward the dimension position, the more relevant the entity word corresponding to the forward dimension is to the entity word jt corresponding to v (considered from two aspects of co-occurrence and rareness).
S1044 is that the vector V which is sorted in the descending order in the V is cut off from the original length L1 to a uniform new length L2.
Specifically, the value of L2 needs to be selected by observing the vector data. The value of L2 is chosen such that when the length is truncated to L2, the entity words corresponding to the first L2 vector dimensions are all the entity words related to the entity word jt, and the sum of the values of the first L2 vector dimensions (i.e., greater than a certain value r e (0.0, 1.0)). The value of r is a probability and a threshold value, and needs to be selected according to experience. The search of the two parameters, L2 and r, can be tuned in conjunction with the effects of downstream tasks.
And S105, sampling based on the dimension value of the generated vector to construct an embedding training corpus.
Specifically, each V in V corresponds to an entity word jt, each V with the dimension number L2 corresponds to a specific jt, each dimension of each V corresponds to an entity word, and the same dimension of different V may correspond to different entity words. And for each jt, sampling the entity word n corresponding to each dimension according to the value of each dimension from the v corresponding to the jt. There are many specific sampling methods, and the basic principle is to determine whether an entity word is sampled based on probability values. In a simple way, if a certain dimension has a value of q and the corresponding entity word is n, the probability of q/1 is used to decide whether to sample n. Performing traversal sampling on all dimensions of v corresponding to one jt, and assuming that sampling obtains n1, n2, n3, n4, n50, n53 and n81 which are entity words; a sentence n1, n2, n3, n4, jt, n50, n53, n81 is generated. Note that the jt word is placed in the middle of a sentence, which facilitates the setting of context window parameters of the subsequent embedding algorithm. For a jt, a plurality of sentence samples are needed to be made, and a plurality of sentences containing the jt word at the middle position are generated. The number of sentences generated for each jt is in principle the same. Specifically, the number p of sentences generated for a jt needs to be set through some heuristic rules. For example, L2 sentences are generated for each jt, which can probabilistically ensure that entity words corresponding to one dimension are most probabilistically present in a certain sentence, and specifically, constructed sample data is used.
And S106, training the constructed corpus by using a traditional embedding model to obtain a representation model.
The training corpus T of the traditional embedding algorithm is obtained through S105, and then the embedding model is trained by using the mainstream word embedding algorithm, so as to obtain the vector representation of the entity words in the corpus T. The newly acquired vector represents semantic information that may better represent entity words in T. The method has better effect on two tasks of calculating the similarity of two entity words and recalling the topn entity words with the most similar entity words.
In this embodiment, the traditional embedding model is trained by adopting a common mainstream word embedding algorithm, and the traditional embedding algorithm includes any one of word2vector, gLove and fasttext.
Referring to fig. 3, fig. 3 is a schematic block diagram of an entity representing apparatus according to an embodiment of the present application, where the entity representing apparatus may be configured in a server for executing the aforementioned training method.
As shown in fig. 3, the entity representation apparatus 200 includes: the data acquisition unit 201, the document composition unit 202, the tf-idf construction unit 203, the normalization unit 204, the corpus unit 205 and the model training unit 206.
The data acquisition unit 201 crawls jd data published on the internet to acquire a target entity word t.
The document composing unit 202 represents the target entity word t as a document composed of entity words co-occurring with the target entity word t.
A tf-idf unit 203 is constructed and the tf-idf model is trained using the constructed documents.
And a normalizing unit 204 for normalizing the vector representation of the document under the tf-idf model by L1, and performing dimension sorting and dimension truncation.
The corpus unit 205 performs sampling based on the dimension value of the generated vector to construct an embedding corpus.
The model training unit 206 performs training on the constructed corpus by using the traditional embedding model to obtain the representation model
It should be noted that, as will be clear to those skilled in the art, for convenience and brevity of description, the specific working processes of the apparatus and the units described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
The apparatus described above may be implemented in the form of a computer program which is executable on a computer device as shown in fig. 4.
Referring to fig. 4, fig. 4 is a schematic block diagram of a computer device according to an embodiment of the present disclosure. The computer device may be a server or a terminal.
Referring to fig. 4, the computer device includes a processor, a memory, and a network interface connected through a system bus, wherein the memory may include a nonvolatile storage medium and an internal memory.
The non-volatile storage medium may store an operating system and a computer program. The computer program includes program instructions that, when executed, cause a processor to perform any one of the human resources domain entity word representation learning methods with motion based on optimized weight sampling.
The processor is used for providing calculation and control capability and supporting the operation of the whole computer equipment.
The internal memory provides an environment for execution of a computer program on a non-volatile storage medium, which when executed by the processor, causes the processor to perform any one of the human resources domain entity word representation learning methods that exercise optimal weight sampling based.
The network interface is used for network communication, such as sending assigned tasks and the like. Those skilled in the art will appreciate that the architecture shown in fig. 4 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
It should be understood that the Processor may be a CentraL Processing Unit (CPU), and the Processor may be other general purpose Processor, a DigitaL SignaL Processor (DSP), an AppLication Specific Integrated Circuit (ASIC), an off-the-shelf ProgrammabLe Gate Array (FPGA) or other ProgrammabLe logic device, discrete Gate or transistor logic, discrete hardware components, etc. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Wherein, in one embodiment, the processor is configured to execute a computer program stored in the memory to implement the steps of:
crawling jd data disclosed on the Internet to obtain a target entity word t; representing the target entity word t as a document consisting of entity words co-existing with the target entity word t; training a tf-idf model by using the constructed document; carrying out L1 normalization on vector representation of the document under the tf-idf model, and carrying out dimension sequencing and dimension truncation; sampling based on the dimension value of the generated vector to construct an embedding training corpus; and training the constructed corpus by using a traditional embedding model to obtain a representation model.
While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (9)

1. An entity word representation learning method is characterized by comprising the following steps:
crawling jd data disclosed on the Internet to obtain a target entity word t;
representing the target entity word t as a document consisting of entity words co-existing with the target entity word t;
training a tf-idf model by using the constructed document;
carrying out L1 normalization on vector representation of the document under the tf-idf model, and carrying out dimension sequencing and dimension truncation;
sampling based on the dimension value of the generated vector to construct an embedding training corpus;
and training the constructed corpus by using a traditional embedding model to obtain a representation model.
2. The learning method of claim 1, wherein the training of the tf-idf model with the constructed documents comprises calculation of tf values; the calculation of the tf value comprises: term frequency ft, d/S or term frequency Log (1+ ft, d); where ft, d is the frequency with which the entity word t appears in jt _ doc d, and S is the total number of all entity words in jd _ doc d.
3. The learning method of claim 1, wherein the training of tf-idf models with constructed documents further comprises the calculation of idf values, and the tf values are calculated in a normalized manner.
4. The learning method according to claim 2, wherein the calculation of the idf value includes: an inverse document frequency Log (N/nt) or inverse document frequency wood Log (N/(1+ nt)) + 1; where N is the number of elements in set M and nt is the number of documents in set M that contain the entity word t.
5. The learning method according to claim 1, wherein the grouping vector representations of documents under tf-idf model into L1 and performing dimension sorting and dimension truncation comprises:
calculating the tfidf vector V of each jt _ doc in the training tf-idf model pair M in the vector space;
l1 normaLization is performed for each vector in V so that all dimension values of a vector V are greater than or equal to 0.0 and the sum is 1.0;
sorting the dimensionalities of each vector in the V in a descending order according to the tf-idf value corresponding to the dimensionality;
and (4) truncating the vector V which is sorted in the descending order from the original length L1 to a uniform new length L2.
6. The learning method of claim 2, wherein the conventional embedding model comprises: any one of word2vector, gLove, fasttext.
7. An entity word representation apparatus, comprising:
the data acquisition unit is used for crawling jd data disclosed on the Internet to acquire a target entity word t;
a document composition unit for representing the target entity word t as a document composed of entity words co-occurring with the target entity word t;
a tf-idf model unit is constructed, and the constructed documents are used for training a tf-idf model;
the normalizing unit is used for normalizing the vector representation of the document under the tf-idf model by L1, and performing dimension sequencing and dimension truncation;
the corpus unit is used for sampling based on the dimension value of the generated vector to construct an embedding training corpus;
and the model training unit is used for training on the constructed corpus by adopting a traditional embedding model to obtain the representation model.
8. A computer device, wherein the computer device comprises a memory and a processor;
the memory is used for storing a computer program;
the processor for executing the computer program and implementing the learning method of any one of claims 1 to 7 when executing the computer program.
9. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to implement the learning method according to any one of claims 1 to 7.
CN202010890302.6A 2020-08-29 2020-08-29 Entity word representation learning method and device, computer equipment and storage medium Pending CN111967252A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010890302.6A CN111967252A (en) 2020-08-29 2020-08-29 Entity word representation learning method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010890302.6A CN111967252A (en) 2020-08-29 2020-08-29 Entity word representation learning method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN111967252A true CN111967252A (en) 2020-11-20

Family

ID=73400686

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010890302.6A Pending CN111967252A (en) 2020-08-29 2020-08-29 Entity word representation learning method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111967252A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113392647A (en) * 2020-11-25 2021-09-14 腾讯科技(深圳)有限公司 Corpus generation method, related device, computer equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101807211A (en) * 2010-04-30 2010-08-18 南开大学 XML-based retrieval method oriented to constraint on integrated paths of large amount of small-size XML documents
CN104933164A (en) * 2015-06-26 2015-09-23 华南理工大学 Method for extracting relations among named entities in Internet massive data and system thereof
CN107832306A (en) * 2017-11-28 2018-03-23 武汉大学 A kind of similar entities method for digging based on Doc2vec
US20190130025A1 (en) * 2017-10-30 2019-05-02 International Business Machines Corporation Ranking of documents based on their semantic richness
CN110781393A (en) * 2019-10-23 2020-02-11 中南大学 Traffic event factor extraction algorithm based on graph model and expansion convolution neural network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101807211A (en) * 2010-04-30 2010-08-18 南开大学 XML-based retrieval method oriented to constraint on integrated paths of large amount of small-size XML documents
CN104933164A (en) * 2015-06-26 2015-09-23 华南理工大学 Method for extracting relations among named entities in Internet massive data and system thereof
US20190130025A1 (en) * 2017-10-30 2019-05-02 International Business Machines Corporation Ranking of documents based on their semantic richness
CN107832306A (en) * 2017-11-28 2018-03-23 武汉大学 A kind of similar entities method for digging based on Doc2vec
CN110781393A (en) * 2019-10-23 2020-02-11 中南大学 Traffic event factor extraction algorithm based on graph model and expansion convolution neural network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张芳芳;曹兴超;: "基于字面和语义相关性匹配的智能篇章排序", 山东大学学报(理学版), no. 03 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113392647A (en) * 2020-11-25 2021-09-14 腾讯科技(深圳)有限公司 Corpus generation method, related device, computer equipment and storage medium
CN113392647B (en) * 2020-11-25 2024-04-26 腾讯科技(深圳)有限公司 Corpus generation method, related device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
US11232140B2 (en) Method and apparatus for processing information
CN108363790B (en) Method, device, equipment and storage medium for evaluating comments
US11651236B2 (en) Method for question-and-answer service, question-and-answer service system and storage medium
Creutz et al. Unsupervised morpheme segmentation and morphology induction from text corpora using Morfessor 1.0
WO2022095374A1 (en) Keyword extraction method and apparatus, and terminal device and storage medium
US9020810B2 (en) Latent semantic analysis for application in a question answer system
CN110674317B (en) Entity linking method and device based on graph neural network
CN110210028B (en) Method, device, equipment and medium for extracting domain feature words aiming at voice translation text
CN109299280B (en) Short text clustering analysis method and device and terminal equipment
US20140129510A1 (en) Parameter Inference Method, Calculation Apparatus, and System Based on Latent Dirichlet Allocation Model
WO2018165932A1 (en) Generating responses in automated chatting
EP3377983A1 (en) Generating feature embeddings from a co-occurrence matrix
CN111967252A (en) Entity word representation learning method and device, computer equipment and storage medium
CN110427626B (en) Keyword extraction method and device
CN107729509B (en) Discourse similarity determination method based on recessive high-dimensional distributed feature representation
Biçici et al. Referential translation machines for predicting semantic similarity
JP2016161967A (en) Parameter learning apparatus, text summarizing unit, method and program
CN109800438B (en) Method and apparatus for generating information
Zeng et al. Lexicon expansion for latent variable grammars
CN110472140B (en) Object word recommendation method and device and electronic equipment
CN113609287A (en) Text abstract generation method and device, computer equipment and storage medium
CN107622129B (en) Method and device for organizing knowledge base and computer storage medium
CN114064846A (en) Theme similarity determination method and device, electronic equipment and storage medium
JP2011081626A (en) Dictionary registering device, document label determination system, and dictionary registration program
CN110826313A (en) Information extraction method, electronic equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination