CN107894979B - Compound word processing method, device and equipment for semantic mining - Google Patents

Compound word processing method, device and equipment for semantic mining Download PDF

Info

Publication number
CN107894979B
CN107894979B CN201711163429.2A CN201711163429A CN107894979B CN 107894979 B CN107894979 B CN 107894979B CN 201711163429 A CN201711163429 A CN 201711163429A CN 107894979 B CN107894979 B CN 107894979B
Authority
CN
China
Prior art keywords
dimensional
word
compound
words
participles
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711163429.2A
Other languages
Chinese (zh)
Other versions
CN107894979A (en
Inventor
陈徐屹
冯仕堃
朱志凡
何径舟
朱丹翔
曹宇慧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201711163429.2A priority Critical patent/CN107894979B/en
Publication of CN107894979A publication Critical patent/CN107894979A/en
Application granted granted Critical
Publication of CN107894979B publication Critical patent/CN107894979B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention provides a compound word processing method, a device and equipment for semantic mining, wherein the method comprises the following steps: determining M participles of each sentence in a training corpus; selecting N participles according to the appearance sequence of the M participles to generate an N-dimensional compound word, wherein M is more than or equal to 2, and N is more than or equal to 2 and less than or equal to M; performing Hash operation on the character string of the N-dimensional compound word for K times, inquiring a pre-established random Hash dictionary space to obtain a position uniquely corresponding to the Hash operation result of each time, and generating a K-dimensional word vector of the N-dimensional compound word according to floating point numbers of K positions corresponding to the Hash operation results of the K times, wherein K is an integer greater than 1; and screening out N-dimensional target compound words meeting preset conditions according to the K-dimensional word vectors of all the N-dimensional compound words, and inputting the N-dimensional target compound words to the word bag model for semantic mining. Therefore, more semantic features with larger granularity are introduced into the bag-of-words model, and the effect of the bag-of-words model is further improved.

Description

Compound word processing method, device and equipment for semantic mining
Technical Field
The invention relates to the technical field of information processing, in particular to a method, a device and equipment for processing compound words for semantic mining.
Background
Artificial Intelligence (Artificial Intelligence), abbreviated in english as AI. The method is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding human intelligence. Artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence, a field of research that includes robotics, speech recognition, image recognition, natural language processing, and expert systems.
At present, a common Bag of Words (Bag of Words) model is widely applied to a text semantic relevance matching task. In the related technology, Bigram (binary grammar) is adopted to count the occurrence probability of two adjacent words in the training language chat, the probability of two words appearing at the same time is obtained through T-statistics statistical sorting, and therefore a compound word obtained by binding two words with high probability appearing at the same time is embedded into a word vector space as a new semantic feature and is input to a word bag model.
However, for each batch of new training corpus, T-statistics of Bigram thereof needs to be re-counted to form Bigram vocabulary, then training of the bag-of-words model can be started, which causes a large training overhead, and a compound word obtained by binding two words is embedded into a word vector space as a new semantic feature and is input to the bag-of-words model, which affects the effect of the bag-of-words model.
Disclosure of Invention
The present invention has been made to solve at least one of the technical problems of the related art to some extent.
Therefore, a first objective of the present invention is to provide a compound word processing method for semantic mining, which is used to solve the problems in the prior art that the corpus training cost is high, and in order to improve the bag-of-words model effect, more binary bound words need to be introduced, which affects the memory performance; or only embedding a compound word obtained by binding two words into a word vector space as a new semantic feature, and inputting the compound word into the bag-of-words model to influence the effect of the bag-of-words model.
The second purpose of the invention is to provide a compound word processing device for semantic mining.
A third object of the invention is to propose a computer device.
A fourth object of the invention is to propose a non-transitory computer-readable storage medium.
A fifth object of the invention is to propose a computer program product.
In order to achieve the above object, a first embodiment of the present invention provides a compound word processing method for semantic mining, where the method includes the following steps: determining M participles of each sentence in a training corpus; selecting N participles according to the appearance sequence of the M participles to generate an N-dimensional compound word, wherein M is more than or equal to 2, and N is more than or equal to 2 and less than or equal to M; performing Hash operation on the character string of the N-dimensional compound word for K times, inquiring a pre-established random Hash dictionary space to obtain a position uniquely corresponding to the Hash operation result of each time, and generating a K-dimensional word vector of the N-dimensional compound word according to floating point numbers of K positions corresponding to the Hash operation results of the K times, wherein K is an integer greater than 1; and screening out N-dimensional target compound words meeting preset conditions according to the K-dimensional word vectors of all the N-dimensional compound words, and inputting the N-dimensional target compound words to a word bag model for semantic mining.
The method for processing the compound words for semantic mining comprises the steps of determining M participles of each word in a training corpus, selecting N participles according to the occurrence sequence of the M participles to generate N-dimensional compound words, conducting Hash operation on character strings of the N-dimensional compound words for K times, inquiring a pre-established random Hash dictionary space to obtain a position uniquely corresponding to a Hash operation result of each time, generating K-dimensional word vectors of the N-dimensional compound words according to floating point numbers of the K positions corresponding to the Hash operation result of the K times, screening out N-dimensional target compound words meeting preset conditions according to the K-dimensional word vectors of all the N-dimensional compound words, and inputting the N-dimensional target compound words to a word bag model for semantic mining. Therefore, the corpus can be directly trained, the corpus training cost is reduced, the N-dimensional target compound words can be acquired and input to the bag-of-words model for semantic mining, more semantic features with larger granularity are introduced into the bag-of-words model while the memory performance is not influenced, and the effect of the bag-of-words model is further improved.
In order to achieve the above object, a second embodiment of the present invention provides a compound word processing apparatus for semantic mining, the apparatus including: the determining module is used for determining M participles of each sentence in the training corpus; the first generation module is used for selecting N participles according to the occurrence sequence of the M participles to generate an N-dimensional compound word, wherein M is more than or equal to 2, and N is more than or equal to 2 and less than or equal to M; the first processing module is used for performing Hash operation on the character string of the N-dimensional compound word for K times, inquiring a pre-established random Hash dictionary space to obtain a position uniquely corresponding to the Hash operation result of each time, and generating a K-dimensional word vector of the N-dimensional compound word according to floating point numbers of K positions corresponding to the Hash operation results of the K times, wherein K is an integer greater than 1; the screening module is used for screening out N-dimensional target compound words meeting preset conditions according to the K-dimensional word vectors of all the N-dimensional compound words; and the mining module is used for inputting the N-dimensional target compound words into a word bag model for semantic mining.
The compound word processing device for semantic mining determines M word segments of each word in a training corpus, then selects N word segments according to the occurrence sequence of the M word segments to generate N-dimensional compound words, carries out Hash operation on character strings of the N-dimensional compound words for K times, inquires a pre-established random Hash dictionary space to obtain a position uniquely corresponding to the Hash operation result of each time, generates K-dimensional word vectors of the N-dimensional compound words according to floating point numbers of the K positions corresponding to the Hash operation result of the K times, screens out N-dimensional target compound words meeting preset conditions according to the K-dimensional word vectors of all the N-dimensional compound words, and inputs the N-dimensional target compound words to a word bag model for semantic mining. Therefore, the corpus can be directly trained, the corpus training cost is reduced, the N-dimensional target compound words can be acquired and input to the bag-of-words model for semantic mining, more semantic features with larger granularity are introduced into the bag-of-words model while the memory performance is not influenced, and the effect of the bag-of-words model is further improved.
To achieve the above object, a third aspect of the present invention provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements a compound word processing method for semantic mining, and the method includes: determining M participles of each sentence in a training corpus; selecting N participles according to the appearance sequence of the M participles to generate an N-dimensional compound word, wherein M is more than or equal to 2, and N is more than or equal to 2 and less than or equal to M; performing Hash operation on the character string of the N-dimensional compound word for K times, inquiring a pre-established random Hash dictionary space to obtain a position uniquely corresponding to the Hash operation result of each time, and generating a K-dimensional word vector of the N-dimensional compound word according to floating point numbers of K positions corresponding to the Hash operation results of the K times, wherein K is an integer greater than 1; and screening out N-dimensional target compound words meeting preset conditions according to the K-dimensional word vectors of all the N-dimensional compound words, and inputting the N-dimensional target compound words to a word bag model for semantic mining.
To achieve the above object, a fourth aspect of the present invention provides a non-transitory computer-readable storage medium, wherein instructions of the storage medium, when executed by a processor, enable execution of a compound word processing method for semantic mining, the method comprising: determining M participles of each sentence in a training corpus; selecting N participles according to the appearance sequence of the M participles to generate an N-dimensional compound word, wherein M is more than or equal to 2, and N is more than or equal to 2 and less than or equal to M; performing Hash operation on the character string of the N-dimensional compound word for K times, inquiring a pre-established random Hash dictionary space to obtain a position uniquely corresponding to the Hash operation result of each time, and generating a K-dimensional word vector of the N-dimensional compound word according to floating point numbers of K positions corresponding to the Hash operation results of the K times, wherein K is an integer greater than 1; and screening out N-dimensional target compound words meeting preset conditions according to the K-dimensional word vectors of all the N-dimensional compound words, and inputting the N-dimensional target compound words to a word bag model for semantic mining.
In order to achieve the above object, a fifth aspect of the present invention provides a computer program product, wherein when being executed by an instruction processor, a compound word processing method for semantic mining is performed, and the method comprises: determining M participles of each sentence in a training corpus; selecting N participles according to the appearance sequence of the M participles to generate an N-dimensional compound word, wherein M is more than or equal to 2, and N is more than or equal to 2 and less than or equal to M; performing Hash operation on the character string of the N-dimensional compound word for K times, inquiring a pre-established random Hash dictionary space to obtain a position uniquely corresponding to the Hash operation result of each time, and generating a K-dimensional word vector of the N-dimensional compound word according to floating point numbers of K positions corresponding to the Hash operation results of the K times, wherein K is an integer greater than 1; and screening out N-dimensional target compound words meeting preset conditions according to the K-dimensional word vectors of all the N-dimensional compound words, and inputting the N-dimensional target compound words to a word bag model for semantic mining.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a flow diagram of a compound word processing method for semantic mining according to one embodiment of the invention;
FIG. 2 is an exemplary diagram of a hierarchical feature extraction approach according to one embodiment of the present invention;
FIG. 3 is an exemplary diagram of a random hash dictionary space, according to one embodiment of the present invention;
FIG. 4 is a flow diagram of a compound word processing method for semantic mining according to another embodiment of the present invention;
FIG. 5 is an exemplary diagram of a random hash dictionary that may coexist with an original word vector dictionary in accordance with one embodiment of the invention;
FIG. 6 is an exemplary diagram of an application customization layer in a compound word processing method for semantic mining in accordance with one embodiment of the present invention;
FIG. 7 is an exemplary diagram of a linear regression model screening, according to one embodiment of the invention;
FIG. 8 is a schematic structural diagram of a compound word processing apparatus for semantic mining according to another embodiment of the present invention
FIG. 9 is a schematic diagram of a computer device according to one embodiment of the invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
The following describes a compound word processing method, device and equipment for semantic mining according to an embodiment of the present invention with reference to the accompanying drawings.
The embodiment of the invention provides a compound word processing method for semantic mining, which can expand the statistical mode of Bigram characteristics to any N adjacent words and combine the words into Ngram phrases. The newly generated words are disordered to realize the statistics of word frequency and other parameters, and are directly introduced as text features to be introduced into training, so that the corpus training cost is reduced, N-dimensional target compound words can be obtained and input into a bag-of-words model for semantic mining, more semantic features with larger granularity are introduced into the bag-of-words model while the memory performance is not influenced, and the effect of the bag-of-words model is further improved. The method comprises the following specific steps:
FIG. 1 is a flow diagram of a compound word processing method for semantic mining according to one embodiment of the invention. As shown in fig. 1, the compound word processing method for semantic mining includes:
step 101, determining M participles of each sentence in a corpus.
And 102, selecting N participles according to the appearance sequence of the M participles to generate an N-dimensional compound word, wherein M is more than or equal to 2, and N is more than or equal to 2 and less than or equal to M.
In practical applications, there are many sentences in the corpus, and M participles in each sentence need to be determined first, for example, 7 participles in the sentence "people's republic of china" are determined as "middle", "chinese", "people", "common", "and" country ".
Therefore, N participles can be selected according to the appearance sequence of the M participles to generate the N-dimensional compound word, and the appearance sequence of the 7 participles is ' middle ', ' China ', ' people ', ' common ', ' and ' country ' taking the sentence of ' people ' republic of China as an example. 2 participles of 'Zhonghua' and 'Hua' can be selected to generate 2-dimensional compound words; 4 participles of 'middle', 'Hua', 'people' and 'people' can be selected to generate 4-dimensional compound words; and 5 participles of ' people ', ' common ', ' and ' country ' can be selected to generate the 5-dimensional compound word. Namely, two or more participles can be selected to generate the multidimensional compound word according to actual application requirements.
It is understood that for the combination of the ith and (i + 1) th adjacent phrases, a unique compound word representation results.
In order to make the specific process of generating the N-dimensional compound word more clear to those skilled in the art, the following is specifically described with reference to fig. 2 as follows:
as shown in fig. 2, it can be seen that for 6 segmented words "a", "B", "C", "D", "E", and "F", N segmented words may be selected according to the appearance order of M segmented words to generate an N-dimensional compound word, for example, adjacent words "a", "B", and "C" may be combined together to form "ABC"; adjacent phrases of "B", "C", "D" taken together with "BCD" and the like result in a three-dimensional compound representation.
And 103, performing Hash operation on the character string of the N-dimensional compound word for K times, inquiring a pre-established random Hash dictionary space to obtain a position uniquely corresponding to the Hash operation result of each time, and generating a K-dimensional word vector of the N-dimensional compound word according to floating point numbers of the K positions corresponding to the Hash operation result of the K times, wherein K is an integer greater than 1.
The random Hash dictionary space is adopted to store newly generated semantic fragments, and the problem of vocabulary explosion can be effectively solved. The implementation of the random hash dictionary space is shown in fig. 3.
Specifically, the unique feature expression of each layer of compound words is obtained by means of hierarchical feature extraction as shown in fig. 2, and is generally a termID character string in the form of "1-2-3" in fig. 3. The only corresponding position in the random hash dictionary corresponding to each hash operation result can be obtained by searching the character string through hash operation, a value is taken from the position to be used as a K-dimensional word vector of the N-dimensional compound word, and the process is repeated until all word vectors of the N-dimensional compound word are obtained. The size of the word list in the random hash dictionary space is irrelevant to the number of the newly added semantic features, and the random hash dictionary space can be randomly configured according to the on-line requirement.
More specifically, each time the hash operation is performed on the character string of the N-dimensional compound word, a position uniquely corresponding to the hash operation result can be found, and the word vector of the N-dimensional compound word can be generated according to the floating point number of the corresponding position. And performing Hash operation for K times to obtain K positions uniquely corresponding to K Hash operation results, so that a K-dimensional word vector of the N-dimensional compound word can be generated according to the floating point numbers of the K positions. Wherein K is an integer greater than 1.
It should be noted that, continuous K floating point numbers are extracted from each position obtained by the hash operation to serve as K-dimensional word vectors of N-dimensional compound words, so that the number of words of the hash operation can be reduced, and the operation performance is improved without affecting the precision. The selection of the hash function needs to ensure the randomness and reproducibility of mapping.
And 104, screening out N-dimensional target compound words meeting preset conditions according to the K-dimensional word vectors of all the N-dimensional compound words, and inputting the N-dimensional target compound words to the bag-of-words model for semantic mining.
Specifically, the bag-of-words model may project vectors of different text features to a low-dimensional space in a simple manner such as summation, and then perform semantic similarity matching. In order to further improve the effect of the bag-of-words model, N-dimensional target compound words meeting preset conditions need to be screened out, so that the N-dimensional target compound words are input to the bag-of-words model for semantic mining.
As a possible implementation manner, inputting the K-dimensional word vector of each N-dimensional compound word into a preset linear regression model, obtaining a weight representing the importance degree of each N-dimensional compound word, obtaining the K-dimensional weighted word vector of each N-dimensional compound word according to the K-dimensional word vector and the corresponding weight of each N-dimensional compound word, and screening out N-dimensional target compound words meeting preset conditions according to the K-dimensional weighted word vectors of all the N-dimensional compound words.
The obtaining of the K-dimensional weighted word vector of each N-dimensional compound word may be, according to the K-dimensional word vector of each N-dimensional compound word and the corresponding weight, calculating a product of the K-dimensional word vector of each N-dimensional compound word and the corresponding weight, and obtaining the K-dimensional weighted word vector of each N-dimensional compound word.
As another possible implementation manner, the K-dimensional word vector of each N-dimensional compound word is input into a preset algorithm for processing, so as to directly obtain an N-dimensional target compound word satisfying a preset condition.
Therefore, each word in each sentence is mapped to a low-dimensional word vector space to be used as a feature vector, each vector represents one word, and on the basis, a language segment with larger granularity can be added to express that the effect of the bag-of-words model is obviously improved.
And further, inputting the N-dimensional target compound words into a bag-of-words model for semantic mining. As an example, semantic detection is performed on a text by applying N-dimensional target compound words in a bag-of-words model, and compound words which do not meet the text semantic meaning are screened out according to the detection result.
In summary, in the compound word processing method for semantic mining according to the embodiment of the present invention, M participles of each sentence in a training corpus are determined, then N participles are selected according to an appearance sequence of the M participles to generate an N-dimensional compound word, a hash operation is performed on a character string of the N-dimensional compound word for K times, a position uniquely corresponding to a hash operation result for each time is obtained from a random hash dictionary space established in advance, a K-dimensional word vector of the N-dimensional compound word is generated according to floating point numbers of the K positions corresponding to the hash operation result for K times, and finally, an N-dimensional target compound word satisfying a preset condition is screened out according to the K-dimensional word vectors of all the N-dimensional compound words, and the N-dimensional target compound word is input to a word bag model for semantic mining. Therefore, the corpus can be directly trained, the corpus training cost is reduced, the N-dimensional target compound words can be acquired and input to the bag-of-words model for semantic mining, more semantic features with larger granularity are introduced into the bag-of-words model while the memory performance is not influenced, and the effect of the bag-of-words model is further improved.
Based on the above embodiments, it can be appreciated that a random hash dictionary can coexist with the original word vector dictionary, further expanding the representation of semantic fragments while preserving bigrams. The following is specifically described with reference to fig. 4:
FIG. 4 is a flowchart illustrating a compound word processing method for semantic mining according to another embodiment of the present invention. As shown in fig. 4, the compound word processing method for semantic mining includes:
the model applied by the compound word processing method for semantic mining in this embodiment is as shown in fig. 5, where a random hash dictionary and an original word vector dictionary coexist, and the representation of semantic fragments is further expanded while bigrams are retained.
Step 201, determining M participles of each sentence in the corpus.
It should be noted that the description of step S201 corresponds to step S101, and thus the description of step S201 refers to the description of step S101, and is not repeated herein.
Step 202, selecting 2 participles according to the appearance sequence of the M participles to generate a two-dimensional compound word.
Step 203, calculating the character string of the two-dimensional compound word to obtain a calculation result, querying an original word vector dictionary space, obtaining a unique position corresponding to the calculation result, and generating a K-dimensional word vector of the two-dimensional compound word by applying a number corresponding to the position, wherein K is an integer greater than 1.
Specifically, 2 participles can be selected according to the appearance order of the M participles to generate a two-dimensional composite word, and taking the sentence "the people's republic of china" as an example, the appearance order of the 7 participles is "middle", "hua", "man", "min", "co", "and", "country". 2 participles of 'Zhonghua' and 'Hua' can be selected to generate 2-dimensional compound words; 2 participles of 'people' and 'people' can be selected to generate a two-dimensional compound word; the selection of "common", "sum", 2 participles can also generate a two-dimensional compound word. Namely, two participles can be selected to generate the two-dimensional compound word according to actual application requirements.
And 204, selecting N participles according to the appearance sequence of the M participles to generate an N-dimensional compound word, wherein M is more than or equal to 2, and N is more than or equal to 2 and less than or equal to M.
Step 205, performing hash operation on the character string of the N-dimensional compound word for K times, querying a pre-established random hash dictionary space to obtain a unique corresponding position to each hash operation result, and generating a K-dimensional word vector of the N-dimensional compound word according to floating point numbers of the K positions corresponding to the K hash operation results, where K is an integer greater than 1.
It should be noted that the descriptions of steps S204-S205 correspond to the above steps S102-S103, and therefore the descriptions of steps S204-S205 refer to the descriptions of steps S102-S103, which are not repeated herein.
And step 206, summing the K-dimensional word vectors of the two-dimensional compound words and the K-dimensional weighted word vectors of all the N-dimensional compound words, and screening out the N-dimensional target compound words meeting the preset conditions according to the sum result.
And step 207, performing semantic detection on the text by applying the N-dimensional target compound words in the bag-of-words model.
And 208, screening out compound words which do not meet the text semantics according to the detection result.
Specifically, the word vectors can be screened through a Logistic Regression model as shown in fig. 6. Inputting K-dimensional word vectors of all N-dimensional compound words into a Logistic Regression to obtain a score representing the importance degree of the word, and multiplying the score serving as a weight to an original word vector to obtain a feature vector weighted according to the importance degree. And finally, summing the weighted feature vectors and the K-dimensional word vectors of the two-dimensional compound words, and screening out the N-dimensional target compound words meeting the preset conditions according to the summing result.
And further, performing semantic detection on the text by using the N-dimensional target compound words in the bag-of-words model, and screening out compound words which do not meet the text semantic meaning according to the detection result.
Therefore, the newly added model structure can coexist with the original structure, and the model performance is further improved.
In order to implement the above embodiments, the present invention further provides a compound word processing apparatus for semantic mining, and fig. 7 is a schematic structural diagram of a compound word processing apparatus for semantic mining according to an embodiment of the present invention. As shown in fig. 7, the compound word processing apparatus for semantic mining includes: a determination module 11, a first generation module 12, a first processing module 13, a screening module 14 and a mining module 15.
The determining module 11 is configured to determine M participles of each sentence in the corpus.
The first generating module 12 is configured to select N participles according to an appearance order of the M participles to generate an N-dimensional compound word, where M is greater than or equal to 2, and N is greater than or equal to 2 and less than or equal to M.
The first processing module 13 is configured to perform hash operation on the character string of the N-dimensional compound word for K times, query a pre-established random hash dictionary space to obtain a unique corresponding position to each hash operation result, and generate a K-dimensional word vector of the N-dimensional compound word according to floating point numbers of K positions corresponding to the K-time hash operation results, where K is an integer greater than 1.
And the screening module 14 is configured to screen out N-dimensional target compound words meeting preset conditions according to the K-dimensional word vectors of all the N-dimensional compound words.
And the mining module 15 is used for inputting the N-dimensional target compound words into the word bag model for semantic mining.
In an embodiment of the present invention, the screening module 14 is specifically configured to: inputting the K-dimensional word vector of each N-dimensional compound word into a preset linear regression model, and acquiring a weight representing the importance degree of each N-dimensional compound word; acquiring a K-dimensional weighted word vector of each N-dimensional compound word according to the K-dimensional word vector and the corresponding weight of each N-dimensional compound word; and screening out the N-dimensional target compound words meeting the preset conditions according to the K-dimensional weighted word vectors of all the N-dimensional compound words.
The method for obtaining the K-dimensional weighted word vector of each N-dimensional compound word according to the K-dimensional word vector and the corresponding weight of each N-dimensional compound word comprises the following steps: and calculating the product of the K-dimensional word vector of each N-dimensional compound word and the corresponding weight to obtain the K-dimensional weighted word vector of each N-dimensional compound word.
In an embodiment of the present invention, the excavation module 15 is specifically configured to: performing semantic detection on the text by applying the N-dimensional target compound words in the bag-of-words model; and screening out compound words which do not meet the text semantics according to the detection result.
It should be noted that the foregoing explanation on the embodiment of the compound word processing method for semantic mining is also applicable to the compound word processing apparatus for semantic mining in this embodiment, and is not repeated here.
In summary, in the compound word processing apparatus for semantic mining according to the embodiment of the present invention, M participles of each sentence in a training corpus are determined, then N participles are selected according to an appearance sequence of the M participles to generate an N-dimensional compound word, a hash operation is performed on a character string of the N-dimensional compound word K times, a position uniquely corresponding to a hash operation result of each time is obtained from a random hash dictionary space established in advance, a K-dimensional word vector of the N-dimensional compound word is generated according to floating point numbers of the K positions corresponding to the hash operation result K times, and finally, an N-dimensional target compound word satisfying a preset condition is screened out according to the K-dimensional word vectors of all the N-dimensional compound words, and the N-dimensional target compound word is input to a word bag model for semantic mining. Therefore, the corpus can be directly trained, the corpus training cost is reduced, the N-dimensional target compound words can be acquired and input to the bag-of-words model for semantic mining, more semantic features with larger granularity are introduced into the bag-of-words model while the memory performance is not influenced, and the effect of the bag-of-words model is further improved.
Fig. 8 is a schematic structural diagram of a compound word processing apparatus for semantic mining according to another embodiment of the present invention. As shown in fig. 8, the method further includes, on the basis of fig. 7: a second generation module 16 and a second processing module 17.
The second generating module 16 is configured to select 2 participles according to the occurrence order of the M participles to generate a two-dimensional compound word.
And the second processing module 17 is configured to calculate a character string of the two-dimensional compound word to obtain a calculation result, query the original word vector dictionary space, obtain a unique position corresponding to the calculation result, and generate a K-dimensional word vector of the two-dimensional compound word by applying a number corresponding to the position, where K is an integer greater than 1.
The screening module 14 is specifically further configured to: and adding the K-dimensional word vectors of the two-dimensional compound words and the K-dimensional weighted word vectors of all the N-dimensional compound words, and screening the N-dimensional target compound words meeting the preset conditions according to the addition result.
Therefore, the newly added model structure can coexist with the original structure, and the model performance is further improved. .
The invention provides a computer device, and fig. 9 is a schematic structural diagram of the computer device according to an embodiment of the invention. As shown in fig. 9, a memory 21, a processor 22, and a computer program stored on the memory 21 and executable on the processor 22.
The processor 22, when executing the program, implements the compound word processing method for semantic mining provided in the above-described embodiments.
Further, the computer device further comprises:
a communication interface 23 for communication between the memory 21 and the processor 22.
A memory 21 for storing a computer program operable on the processor 22.
The memory 21 may comprise a high-speed RAM memory, and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.
And the processor 22 is configured to implement the compound word processing method for semantic mining according to the foregoing embodiment when executing the program.
If the memory 21, the processor 22 and the communication interface 23 are implemented independently, the communication interface 21, the memory 21 and the processor 22 may be connected to each other through a bus and perform communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 9, but this does not indicate only one bus or one type of bus.
Optionally, in a specific implementation, if the memory 21, the processor 22 and the communication interface 23 are integrated on a chip, the memory 21, the processor 22 and the communication interface 23 may complete mutual communication through an internal interface.
The processor 22 may be a Central Processing Unit (CPU), or an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement embodiments of the present invention.
To achieve the above embodiments, the present invention also proposes a non-transitory computer-readable storage medium in which instructions, when executed by a processor, enable execution of a compound word processing method for semantic mining, the method comprising: determining M participles of each sentence in a training corpus; selecting N participles according to the appearance sequence of the M participles to generate an N-dimensional compound word, wherein M is more than or equal to 2, and N is more than or equal to 2 and less than or equal to M; performing Hash operation on the character string of the N-dimensional compound word for K times, inquiring a pre-established random Hash dictionary space to obtain a position uniquely corresponding to the Hash operation result of each time, and generating a K-dimensional word vector of the N-dimensional compound word according to floating point numbers of K positions corresponding to the Hash operation results of the K times, wherein K is an integer greater than 1; and screening out N-dimensional target compound words meeting preset conditions according to the K-dimensional word vectors of all the N-dimensional compound words, and inputting the N-dimensional target compound words to the word bag model for semantic mining.
To achieve the above embodiments, the present invention further provides a computer program product, which when executed by an instruction processor performs a compound word processing method for semantic mining, the method comprising: determining M participles of each sentence in a training corpus; selecting N participles according to the appearance sequence of the M participles to generate an N-dimensional compound word, wherein M is more than or equal to 2, and N is more than or equal to 2 and less than or equal to M; performing Hash operation on the character string of the N-dimensional compound word for K times, inquiring a pre-established random Hash dictionary space to obtain a position uniquely corresponding to the Hash operation result of each time, and generating a K-dimensional word vector of the N-dimensional compound word according to floating point numbers of K positions corresponding to the Hash operation results of the K times, wherein K is an integer greater than 1; and screening out N-dimensional target compound words meeting preset conditions according to the K-dimensional word vectors of all the N-dimensional compound words, and inputting the N-dimensional target compound words to the word bag model for semantic mining.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (9)

1. A compound word processing method for semantic mining is characterized by comprising the following steps:
determining M participles of each sentence in a training corpus;
selecting N participles according to the appearance sequence of the M participles to generate an N-dimensional compound word, wherein M is more than or equal to 2, and N is more than or equal to 2 and less than or equal to M;
performing Hash operation on the character string of the N-dimensional compound word for K times, inquiring a pre-established random Hash dictionary space to obtain a position uniquely corresponding to the Hash operation result of each time, and generating a K-dimensional word vector of the N-dimensional compound word according to floating point numbers of K positions corresponding to the Hash operation results of the K times, wherein K is an integer greater than 1;
screening out N-dimensional target compound words meeting preset conditions according to K-dimensional word vectors of all the N-dimensional compound words, and inputting the N-dimensional target compound words to a word bag model for semantic mining;
the method for screening the N-dimensional target compound words meeting the preset conditions according to the K-dimensional word vectors of all the N-dimensional compound words comprises the following steps:
and inputting the K-dimensional word vector of each N-dimensional compound word into a preset algorithm for processing to obtain the N-dimensional target compound word meeting the preset condition.
2. The method of claim 1, wherein the screening out N-dimensional target compound words satisfying a preset condition according to K-dimensional word vectors of all N-dimensional compound words comprises:
inputting the K-dimensional word vector of each N-dimensional compound word into a preset linear regression model, and acquiring a weight representing the importance degree of each N-dimensional compound word;
acquiring a K-dimensional weighted word vector of each N-dimensional compound word according to the K-dimensional word vector and the corresponding weight of each N-dimensional compound word;
and screening out the N-dimensional target compound words meeting the preset conditions according to the K-dimensional weighted word vectors of all the N-dimensional compound words.
3. The method of claim 2, wherein obtaining a K-dimensional weighted word vector for each N-dimensional compound word from the K-dimensional word vector and corresponding weight for each N-dimensional compound word comprises:
and calculating the product of the K-dimensional word vector of each N-dimensional compound word and the corresponding weight to obtain the K-dimensional weighted word vector of each N-dimensional compound word.
4. The method of claim 2, wherein after said determining M tokens per utterance in the corpus, further comprising:
selecting 2 participles according to the appearance sequence of the M participles to generate a two-dimensional compound word;
calculating the character string of the two-dimensional compound word to obtain a calculation result, inquiring an original word vector dictionary space, obtaining a unique position corresponding to the calculation result, and generating a K-dimensional word vector of the two-dimensional compound word by applying a number corresponding to the position, wherein K is an integer greater than 1;
the method for screening out the N-dimensional target compound words meeting the preset conditions according to the K-dimensional weighted word vectors of all the N-dimensional compound words comprises the following steps:
adding the K-dimensional word vectors of the two-dimensional compound words and the K-dimensional weighted word vectors of all the N-dimensional compound words, and screening out N-dimensional target compound words meeting preset conditions according to the addition result;
the method for screening the N-dimensional target compound words meeting the preset conditions according to the K-dimensional word vectors of all the N-dimensional compound words comprises the following steps:
and inputting the K-dimensional word vector of each N-dimensional compound word into a preset algorithm for processing to obtain the N-dimensional target compound word meeting the preset condition.
5. The method of any one of claims 1-4, wherein said entering said N-dimensional target compound word into a bag of words model for semantic mining comprises:
performing semantic detection on a text by applying the N-dimensional target compound words in the bag-of-words model;
and screening out compound words which do not meet the text semantics according to the detection result.
6. A compound word processing apparatus for semantic mining, comprising:
the determining module is used for determining M participles of each sentence in the training corpus;
the first generation module is used for selecting N participles according to the occurrence sequence of the M participles to generate an N-dimensional compound word, wherein M is more than or equal to 2, and N is more than or equal to 2 and less than or equal to M;
the first processing module is used for performing Hash operation on the character string of the N-dimensional compound word for K times, inquiring a pre-established random Hash dictionary space to obtain a position uniquely corresponding to the Hash operation result of each time, and generating a K-dimensional word vector of the N-dimensional compound word according to floating point numbers of K positions corresponding to the Hash operation results of the K times, wherein K is an integer greater than 1;
the screening module is used for screening out N-dimensional target compound words meeting preset conditions according to the K-dimensional word vectors of all the N-dimensional compound words;
the mining module is used for inputting the N-dimensional target compound words into a word bag model for semantic mining;
the method for screening the N-dimensional target compound words meeting the preset conditions according to the K-dimensional word vectors of all the N-dimensional compound words comprises the following steps:
and inputting the K-dimensional word vector of each N-dimensional compound word into a preset algorithm for processing to obtain the N-dimensional target compound word meeting the preset condition.
7. The apparatus of claim 6, further comprising:
the second generation module is used for selecting 2 participles according to the appearance sequence of the M participles to generate a two-dimensional compound word;
the second processing module is used for calculating the character string of the two-dimensional compound word to obtain a calculation result, inquiring an original word vector dictionary space, obtaining a unique position corresponding to the calculation result, and generating a K-dimensional word vector of the two-dimensional compound word by applying a number corresponding to the position, wherein K is an integer greater than 1;
the screening module is specifically configured to: and adding the K-dimensional word vectors of the two-dimensional compound words and the K-dimensional weighted word vectors of all the N-dimensional compound words, and screening out the N-dimensional target compound words meeting preset conditions according to the addition result.
8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing a compound word processing method for semantic mining according to any one of claims 1-5 when executing the program.
9. A non-transitory computer-readable storage medium having stored thereon a computer program, wherein the program, when executed by a processor, implements the compound word processing method for semantic mining according to any one of claims 1 to 5.
CN201711163429.2A 2017-11-21 2017-11-21 Compound word processing method, device and equipment for semantic mining Active CN107894979B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711163429.2A CN107894979B (en) 2017-11-21 2017-11-21 Compound word processing method, device and equipment for semantic mining

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711163429.2A CN107894979B (en) 2017-11-21 2017-11-21 Compound word processing method, device and equipment for semantic mining

Publications (2)

Publication Number Publication Date
CN107894979A CN107894979A (en) 2018-04-10
CN107894979B true CN107894979B (en) 2021-09-17

Family

ID=61805758

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711163429.2A Active CN107894979B (en) 2017-11-21 2017-11-21 Compound word processing method, device and equipment for semantic mining

Country Status (1)

Country Link
CN (1) CN107894979B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110569498B (en) * 2018-12-26 2022-12-09 东软集团股份有限公司 Compound word recognition method and related device
CN110059183B (en) * 2019-03-22 2022-08-23 重庆邮电大学 Automobile industry user viewpoint emotion classification method based on big data
CN110457692B (en) * 2019-07-26 2021-02-26 清华大学 Method and device for learning compound word representation
CN114548115B (en) * 2022-02-23 2023-01-06 北京三快在线科技有限公司 Method and device for explaining compound nouns and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007219778A (en) * 2006-02-16 2007-08-30 Murata Mach Ltd Document processor
CN103646080A (en) * 2013-12-12 2014-03-19 北京京东尚科信息技术有限公司 Microblog duplication-eliminating method and system based on reverse-order index
CN104657350A (en) * 2015-03-04 2015-05-27 中国科学院自动化研究所 Hash learning method for short text integrated with implicit semantic features
CN105843960A (en) * 2016-04-18 2016-08-10 上海泥娃通信科技有限公司 Semantic tree based indexing method and system
CN107193802A (en) * 2017-05-25 2017-09-22 上海耐相智能科技有限公司 A kind of smart field concept auto acquisition system

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8046212B1 (en) * 2003-10-31 2011-10-25 Access Innovations Identification of chemical names in text-containing documents
JP4236057B2 (en) * 2006-03-24 2009-03-11 インターナショナル・ビジネス・マシーンズ・コーポレーション A system to extract new compound words
KR101744861B1 (en) * 2010-02-12 2017-06-08 구글 인코포레이티드 Compound splitting
CN102200984A (en) * 2010-03-24 2011-09-28 深圳市腾讯计算机系统有限公司 Search method based on compound words and search engine server
US10210246B2 (en) * 2014-09-26 2019-02-19 Oracle International Corporation Techniques for similarity analysis and data enrichment using knowledge sources

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007219778A (en) * 2006-02-16 2007-08-30 Murata Mach Ltd Document processor
CN103646080A (en) * 2013-12-12 2014-03-19 北京京东尚科信息技术有限公司 Microblog duplication-eliminating method and system based on reverse-order index
CN104657350A (en) * 2015-03-04 2015-05-27 中国科学院自动化研究所 Hash learning method for short text integrated with implicit semantic features
CN105843960A (en) * 2016-04-18 2016-08-10 上海泥娃通信科技有限公司 Semantic tree based indexing method and system
CN107193802A (en) * 2017-05-25 2017-09-22 上海耐相智能科技有限公司 A kind of smart field concept auto acquisition system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Functional Hashing for Compressing Neural Networks;Lei Shi等;《https://arxiv.org/pdf/1605.06560.pdf》;20160520;1-10 *
Hash Embeddings for Efficient Word Representations;Dan Svenstrup等;《https://arxiv.org/abs/1709.03933?context=cs.CL》;20170912;1-9 *
基于位置标签与词性结合的组合词抽取方法;欧阳柳波等;《计算机应用研究》;20150929;第33卷(第4期);1062-1065 *

Also Published As

Publication number Publication date
CN107894979A (en) 2018-04-10

Similar Documents

Publication Publication Date Title
CN109658938B (en) Method, device and equipment for matching voice and text and computer readable medium
CN107894979B (en) Compound word processing method, device and equipment for semantic mining
CN110349572B (en) Voice keyword recognition method and device, terminal and server
CN107729313B (en) Deep neural network-based polyphone pronunciation distinguishing method and device
CN110188350B (en) Text consistency calculation method and device
CN107832301B (en) Word segmentation processing method and device, mobile terminal and computer readable storage medium
JP2005158010A (en) Apparatus, method and program for classification evaluation
CN110472043B (en) Clustering method and device for comment text
CN109325242B (en) Method, device and equipment for judging whether sentences are aligned based on word pairs and translation
CN112214593A (en) Question and answer processing method and device, electronic equipment and storage medium
CN111858843B (en) Text classification method and device
CN112906392A (en) Text enhancement method, text classification method and related device
KR20200087977A (en) Multimodal ducument summary system and method
CN112052331A (en) Method and terminal for processing text information
CN112528637A (en) Text processing model training method and device, computer equipment and storage medium
US11928418B2 (en) Text style and emphasis suggestions
CN107918605B (en) Word segmentation processing method and device, mobile terminal and computer readable storage medium
CN115098556A (en) User demand matching method and device, electronic equipment and storage medium
CN106776782B (en) Semantic similarity obtaining method and device based on artificial intelligence
CN113094478B (en) Expression reply method, device, equipment and storage medium
CN111898363B (en) Compression method, device, computer equipment and storage medium for long and difficult text sentence
CN110781292A (en) Text data multi-level classification method and device, electronic equipment and storage medium
CN111222328A (en) Label extraction method and device and electronic equipment
CN111428487B (en) Model training method, lyric generation method, device, electronic equipment and medium
CN111476003B (en) Lyric rewriting method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant