CN107423284B - Method and system for constructing sentence representation fusing internal structure information of Chinese words - Google Patents
Method and system for constructing sentence representation fusing internal structure information of Chinese words Download PDFInfo
- Publication number
- CN107423284B CN107423284B CN201710449875.3A CN201710449875A CN107423284B CN 107423284 B CN107423284 B CN 107423284B CN 201710449875 A CN201710449875 A CN 201710449875A CN 107423284 B CN107423284 B CN 107423284B
- Authority
- CN
- China
- Prior art keywords
- word
- training
- corpus
- vector
- vectors
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 28
- 239000013598 vector Substances 0.000 claims abstract description 213
- 238000012549 training Methods 0.000 claims abstract description 119
- 230000011218 segmentation Effects 0.000 claims abstract description 15
- 238000010276 construction Methods 0.000 claims abstract description 11
- 239000013604 expression vector Substances 0.000 claims abstract description 6
- 238000013528 artificial neural network Methods 0.000 claims description 13
- 230000009466 transformation Effects 0.000 claims description 7
- 239000011159 matrix material Substances 0.000 claims description 5
- 238000011176 pooling Methods 0.000 claims description 4
- 239000000126 substance Substances 0.000 claims description 3
- 238000003058 natural language processing Methods 0.000 abstract description 4
- 230000006870 function Effects 0.000 description 26
- 230000010354 integration Effects 0.000 description 10
- 230000000694 effects Effects 0.000 description 6
- 238000012545 processing Methods 0.000 description 5
- 238000012360 testing method Methods 0.000 description 4
- 238000011156 evaluation Methods 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 239000000470 constituent Substances 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000008092 positive effect Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to the technical field of natural language processing, in particular to a method and a system for constructing sentence expression fused with internal structure information of Chinese words, aiming at solving the problem of low utilization rate of the internal structure information of the words; the construction method comprises the following steps: performing word segmentation on all Chinese repeated statement sentence pairs in the training corpus to obtain a plurality of word corpuses; pre-training each word corpus to obtain pre-training word vectors and pre-training word vectors; integrating all pre-training word vectors and pre-training word vectors in each word corpus to obtain a combined word vector corresponding to the word corpus; determining a final word vector of each word corpus according to the pre-training word vector and the combined word vector in each word corpus, wherein the final word vector represents word internal structure information; and integrating the final word vector of each word corpus in the sentence to be processed to obtain the expression vector of the sentence to be processed. The invention can improve the utilization rate of the internal structure information of the words.
Description
Technical Field
The invention relates to the technical field of natural language processing, in particular to a method and a system for constructing sentence expression fusing internal structure information of Chinese words.
Background
Sentence representation is the mapping of a natural language sentence into a high dimensional space such that semantically similar sentences have a closer distance in this space. Sentence representation is the fundamental task of natural language processing, directly affecting the performance of the entire language processing system. Therefore, much effort has been devoted to the study of how to design suitable sentence representation methods for specific tasks to improve the performance of language processing systems.
The traditional sentence representation method uses a large number of manually designed features to represent the meaning of a sentence, and achieves good effect in various natural language processing tasks. However, the method requires a lot of manpower and professional knowledge, and often needs to select features according to different tasks, so that the problems of poor model generalization capability and difficult feature representation are caused. In recent years, people find that the neural network-based model can automatically extract semantic features of sentences from large-scale texts, and greatly improve the semantic expression effect of the sentences.
However, most of the sentence representation studies are directed to english sentences, and different neural network structures are designed on the word granularity to encode the semantics of the sentences. Different from English, Chinese words are formed by characters, and the characters contain rich semantic information and can reflect the meaning of the words. In fact, researchers have noted this problem and improved the learning of word vectors by using the words in the chinese words, but these methods do not fully utilize the internal information of the chinese words, such as the relationships between words, and are limited to the task of learning word vectors and do not search in the sentence representation. Therefore, how to fully utilize the internal structure information of the words to learn a better sentence representation model is a topic worthy of research.
Disclosure of Invention
In order to solve the problems in the prior art, namely the problem of low utilization rate of the internal structure information of the words, the invention provides a method and a system for constructing sentence expression fusing the internal structure information of Chinese words.
In order to solve the technical problems, the invention provides the following scheme:
a construction method for sentence representation fusing internal structure information of Chinese words comprises the following steps:
performing word segmentation on all Chinese repeated statement sentence pairs in the training corpus to obtain a plurality of word corpuses;
pre-training each word corpus to obtain pre-training word vectors and pre-training word vectors;
integrating all pre-training word vectors and pre-training word vectors in each word corpus to obtain a combined word vector corresponding to the word corpus;
determining a final word vector of each word corpus according to the pre-training word vector and the combined word vector in each word corpus, wherein the final word vector represents word internal structure information;
and integrating the final word vector of each word corpus in the sentence to be processed to obtain the expression vector of the sentence to be processed.
Optionally, the pre-training of each word corpus specifically includes:
splitting each word corpus according to characters to obtain a character corpus;
splicing word linguistic data and word linguistic data to obtain a word vector and a word vector;
and pre-training the character vectors and the word vectors by utilizing an open source model to obtain corresponding pre-training character vectors and pre-training word vectors.
Optionally, the integrating all the pre-training word vectors and pre-training word vectors in each word corpus specifically includes:
splicing the pre-training word vector and the pre-training word vector of each word corpus to obtain a spliced vector corresponding to the pre-training word vector;
inputting the spliced vector into a feedforward neural network and carrying out nonlinear transformation to obtain a mask vector corresponding to the pre-training word vector;
and determining a combined word vector of each word corpus according to all the pre-training word vectors and the corresponding mask vectors in each word corpus.
Optionally, inputting the stitching vector into a feedforward neural network and performing nonlinear transformation, specifically including:
determining a mask vector v according to the following formulaij:
vij=tanh(W·[cij;xi])
Where tanh () represents a hyperbolic tangent function, W is a parameter of a feedforward neural network, cijIs the ith word corpus xiThe jth pre-training word vector.
Optionally, the determining a combined word vector of each word corpus according to all pre-training word vectors and corresponding mask vectors in each word corpus specifically includes:
according to the following formula, the inner products of all the pre-training word vectors in each word corpus and the corresponding mask vectors are summed to obtain the combined word vector of the word corpus
Wherein, cijIs the ith word corpus xiMiddle j-th pre-training word vector, vijFor pre-training word vector cijAnd m represents the total number of pre-training word vectors of the ith word corpus.
Optionally, the determining a final word vector of each word corpus according to the pre-training word vector and the combined word vector in each word corpus specifically includes:
taking the maximum value of each dimension of the pre-training word vector and the combined word vector as a final word vector based on a maximum pooling method according to the following formula
Wherein the content of the first and second substances,a pre-training word vector representing the ith word corpus in the kth dimension,represents the combined word vector of the ith word corpus in the k dimension, d represents all dimensions of the ith word corpus, and max () represents the function of taking the maximum value.
Optionally, the integrating the final word vector of each word corpus in the sentence to be processed to obtain the representation vector of the sentence to be processed specifically includes:
and integrating the final word vectors into a representation vector of the sentence to be processed through a sentence combination function.
Optionally, the sentence combination function includes at least one of an Average model function, a Matrix model function, a Dan model function, an RNN model function, and an LSTM model function.
Optionally, the corpus is chinese text corpus crawled from hundred degree encyclopedia.
According to the embodiment of the invention, the invention discloses the following technical effects:
the method for constructing the sentence representation fusing the internal structure information of the Chinese words integrates the final word vectors representing the internal structure information of the words so as to accurately determine the representation vector of the sentence to be processed and improve the utilization rate of the internal structure information of the words by segmenting the training corpus, pre-training the word corpus, integrating the pre-training word vectors and determining the final word vectors.
In order to solve the technical problems, the invention also provides the following scheme:
a construction system for fusing sentence representations of internal structural information of chinese words, the construction system comprising:
the word segmentation unit is used for performing word segmentation on all Chinese repeated statement sentence pairs in the training corpus to obtain a plurality of word corpuses;
the pre-training unit is used for pre-training each word corpus to obtain pre-training word vectors and pre-training word vectors;
the first integration unit is used for integrating all pre-training word vectors and pre-training word vectors in each word corpus to obtain a combined word vector corresponding to the word corpus;
the determining unit is used for determining a final word vector of each word corpus according to a pre-training word vector and the combined word vector in each word corpus, and the final word vector represents word internal structure information;
and the second integration unit is used for integrating the final word vector of each word corpus in the sentence to be processed to obtain the expression vector of the sentence to be processed.
According to the embodiment of the invention, the invention discloses the following technical effects:
the building system for sentence representation of Chinese word internal structure information is provided with the word segmentation unit, the pre-training unit, the first integration unit, the determination unit and the second integration unit, can perform word segmentation processing on a training corpus, pre-train a word corpus, integrate a pre-training word vector and determine a final word vector, thereby integrating a plurality of final word vectors representing the word internal structure information to accurately determine the representation vector of the sentence to be processed and improving the utilization rate of the word internal structure information.
Drawings
FIG. 1 is a flow chart of a method of constructing a sentence representation incorporating internal structural information of Chinese words according to the present invention;
FIG. 2 is a schematic diagram of a modular structure of a sentence expression construction system for fusing internal structural information of Chinese words according to the present invention.
Description of the symbols:
word segmentation unit-1, pre-training unit-2, first integration unit-3, determination unit-4, second integration unit-5.
Detailed Description
Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and are not intended to limit the scope of the present invention.
The invention provides a method for constructing sentence representation fused with internal structure information of Chinese words, which integrates a plurality of final word vectors representing internal structure information of words to accurately determine a representation vector of a sentence to be processed and improve the utilization rate of the internal structure information of the words by performing word segmentation processing on a training corpus, performing pre-training on the word corpus, integrating a pre-training word vector and determining a final word vector.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
As shown in FIG. 1, the method for constructing sentence expression by fusing internal structure information of Chinese words in the invention comprises:
step 100: performing word segmentation on all Chinese repeated statement sentence pairs in the training corpus to obtain a plurality of word corpuses;
step 200: pre-training each word corpus to obtain pre-training word vectors and pre-training word vectors;
step 300: integrating all pre-training word vectors and pre-training word vectors in each word corpus to obtain a combined word vector corresponding to the word corpus;
step 400: determining a final word vector of each word corpus according to the pre-training word vector and the combined word vector in each word corpus, wherein the final word vector represents word internal structure information;
step 500: and integrating the final word vector of each word corpus in the sentence to be processed to obtain the expression vector of the sentence to be processed.
In step 100, the corpus is a chinese text corpus crawled from hundred encyclopedia.
There are many ways to segment chinese sentences. In the embodiment, the Chinese sentences are segmented by using an open-source segmentation tool.
The sentence pairs are repeated in Chinese:for example, after word segmentation, the Chinese restated sentence pair can be expressed as:
in step 200, the pre-training of each word corpus specifically includes:
step 201: and splitting each word corpus according to characters to obtain a word corpus.
Step 202: and splicing the word linguistic data and the word linguistic data to obtain a word vector and a word vector.
Step 203: and pre-training the character vectors and the word vectors by utilizing an open source model to obtain corresponding pre-training character vectors and pre-training word vectors.
In this embodiment, the open source model is a skip-gram model, but not limited thereto.
Taking "japan" as an example, the obtained 300-dimensional word vector and word vector are:
"Japanese-0.2434300.2944200.188458-0.0929210.1392860.1865990.011289-0.218883-0.1810620.152754 …";
"Ri-0.3849000.2144930.187968-0.0384640.0575210.069445-0.218115-0.035687-0.126120-0.419776-0.312976 …".
In step 300, the integrating all the pre-training word vectors and pre-training word vectors in each word corpus specifically includes:
step 301: and splicing the pre-training word vector and the pre-training word vector of each word corpus to obtain a spliced vector corresponding to the pre-training word vector.
Taking "japan" as an example, all the pre-training word vectors "day", "this", and the pre-training word vector "japan" in one word corpus "japan" are spliced to obtain two 600-dimensional spliced vectors.
Step 302: and inputting the spliced vector into a feedforward neural network and carrying out nonlinear transformation to obtain a mask vector corresponding to the pre-training word vector.
Inputting the splicing vector into a feedforward neural network and carrying out nonlinear transformation, specifically comprising:
determining a mask vector vijAs shown in equation (1):
vij=tanh(W·[cij;xi]) (1)
where tanh () represents a hyperbolic tangent function, W is a parameter of a feedforward neural network, cijIs the ith word corpus xiJ-th pre-trainingA word vector. Wherein the mask vector vijTo control the ith word corpus xiInfluence on the meaning of the jth pre-training word vector.
Step 303: and determining a combined word vector of each word corpus according to all the pre-training word vectors and the corresponding mask vectors in each word corpus.
Wherein, the determining the combined word vector of the word corpus according to all the pre-training word vectors and the corresponding mask vector in each word corpus specifically includes:
summing the inner products of all pre-training word vectors and corresponding mask vectors in each word corpus to obtain a combined word vector of the word corpusAs shown in equation (2):
wherein, cijIs the ith word corpus xiMiddle j-th pre-training word vector, vijFor pre-training word vector cijAnd m represents the total number of pre-training word vectors of the ith word corpus.
In step 400, the determining a final word vector of the word corpus according to the pre-training word vector and the compound word vector in each word corpus specifically includes:
based on the maximum pooling method, the maximum value is taken from each dimension of the pre-training word vector and the combined word vector as the final word vectorAs shown in equation (3):
wherein the content of the first and second substances,a pre-training word vector representing the ith word corpus in the kth dimension,represents the combined word vector of the ith word corpus in the k dimension, d represents all dimensions of the ith word corpus, and max () represents the function of taking the maximum value.
In step 500, the integrating the final word vector of each word corpus in the sentence to be processed to obtain the representation vector of the sentence to be processed specifically includes:
and integrating the final word vectors into a representation vector of the sentence to be processed through a sentence combination function.
The sentence combination function includes at least one of Average model function, Matrix model function, Dan (transform) model function, RNN (recurrent neural network) model function, and LSTM (long-short term memory) model function.
Average model function, which is to Average the vector representation of all words in a sentence to obtain the final sentence representation RsentenceAs shown in equation (4):
the Matrix model function firstly uses the Average model function to obtain the vector representation of the sentence, then multiplies the vector representation by a Matrix and carries out nonlinear transformation to obtain the final sentence representation, as shown in formula (5):
the Dan model function firstly uses Average model function to obtain vector representation of sentence, and then uses multilayer feedforward neural network to transform the sentence representation to obtain final sentence representation, as shown in formula (6):
RNN model functions, which combine word representations in a sentence to form a final sentence representation, as shown in equation (7):
Rsentence=RNN(x)=f(Wxxi+Whhi-1+b) (7)
the LSTM model function combines the word representations in a sentence to form a final sentence representation, as shown in equation (8).
After the expression vector of each sentence in the sentence pair is obtained, the model parameters are solved by maximizing the distance between the positive examples and the negative examples by adopting the maximum-interval objective function, as shown in the formula (9):
wherein, (x1, x2) represents a positive example, which is a sentence pair with similar meaning; (t1, t2) is a negative example, consisting of randomly combined pairs of sentences,the sentence representing sentence x represents a vector.
Table 1 shows the results of the comparison of the present invention with the word-based model, the word-based model and the word-mean model over three sets of test data (big data, hundredth data, Total (sum of big data and hundredth data)). The training data includes 30846 sentence pairs. From table 1 it can be found: compared with a word-based model, the evaluation index (Pearson) of the correlation degree between the model predicted value and the standard numerical value is improved by 2.00% of the Pearson correlation averagely, and compared with a word average model, the evaluation index (Pearson) of the correlation degree between the model predicted value and the standard numerical value is improved by 1.52% of the Pearson correlation. The effectiveness and the superiority of the construction method of the sentence expression fusing the internal structure information of the Chinese words are fully demonstrated.
TABLE 1 Pearson relevance across different sentence similarity test sets
In addition, the attached table 2 shows the effect of the present invention on word similarity test set with word-based models, word-based models and word-mean models. The method can be directly obtained, and the performance of word representation can be effectively improved.
TABLE 2 Pearson relevance on word similarity test set
The construction method for sentence representation fusing the internal structure information of the Chinese words has the following positive effects: the Chinese words are formed by words, and for most words, the meaning of the words greatly influences the meaning of the words formed by the words; while a small portion of chinese words are non-combinative words whose meaning is independent of the meaning of the constituent words. According to the invention, by modeling the internal structural characteristics of the Chinese words, the word representation effect can be effectively improved, and non-combined words can be automatically identified to a certain extent. The invention uses mask door mechanism to control the contribution degree of different words in a word to the word semantic, uses the maximum pooling method to select the meaning of the word as a whole or combined by the word meanings, and automatically learns the weights of the two.
Through experiments on the task of Chinese sentence similarity, experimental results show that compared with a sentence representation model based on words, the method has the advantage that the average Pearson correlation is improved by 2.00%; compared with a sentence representation model based on a word average method, the method has the advantage that the Pearson relevance is improved by 1.52 percent on average. This fully demonstrates the effectiveness and superiority of fusing the internal structure of words.
In addition, the invention also provides a construction system for sentence representation by fusing the internal structure information of the Chinese words. As shown in FIG. 2, the system for constructing sentence expression fused with internal structure information of Chinese words according to the present invention includes a segmentation unit 1, a pre-training unit 2, a first integration unit 3, a determination unit 4, and a second integration unit 5.
The word segmentation unit 1 is used for performing word segmentation processing on all Chinese repeated statement sentence pairs in a training corpus to obtain a plurality of word corpuses; the pre-training unit 2 is used for pre-training each word corpus to obtain pre-training word vectors and pre-training word vectors; the first integration unit 3 is configured to integrate all pre-training word vectors and pre-training word vectors in each word corpus to obtain a combined word vector corresponding to the word corpus; the determining unit 4 is configured to determine a final word vector of each word corpus according to a pre-training word vector and the combined word vector in each word corpus, where the final word vector represents word internal structure information; the second integration unit 5 is configured to integrate final word vectors of word corpora in the sentence to be processed to obtain a representation vector of the sentence to be processed.
Compared with the prior art, the construction system for the sentence expression fusing the internal structure information of the Chinese word has the same beneficial effects as the construction method for the sentence expression fusing the internal structure information of the Chinese word, and the description is omitted.
So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.
Claims (8)
1. A construction method for sentence representation fusing internal structure information of Chinese words is characterized by comprising the following steps:
performing word segmentation on all Chinese repeated statement sentence pairs in the training corpus to obtain a plurality of word corpuses;
pre-training each word corpus to obtain pre-training word vectors and pre-training word vectors;
integrating all pre-training word vectors and pre-training word vectors in each word corpus to obtain a combined word vector corresponding to the word corpus;
determining a final word vector of each word corpus according to the pre-training word vector and the combined word vector in each word corpus, wherein the final word vector represents word internal structure information;
integrating final word vectors of word corpora in the sentence to be processed to obtain a representation vector of the sentence to be processed;
wherein, said integrating all pre-training word vectors and pre-training word vectors in each word corpus specifically comprises:
splicing the pre-training word vector and the pre-training word vector of each word corpus to obtain a spliced vector corresponding to the pre-training word vector;
inputting the spliced vector into a feedforward neural network and carrying out nonlinear transformation to obtain a mask vector corresponding to the pre-training word vector;
and determining a combined word vector of each word corpus according to all the pre-training word vectors and the corresponding mask vectors in each word corpus.
2. The method for constructing sentence representations fusing internal structural information of chinese words according to claim 1, wherein the pre-training of each word corpus specifically comprises:
splitting each word corpus according to characters to obtain a character corpus;
splicing word linguistic data and word linguistic data to obtain a word vector and a word vector;
and pre-training the character vectors and the word vectors by utilizing an open source model to obtain corresponding pre-training character vectors and pre-training word vectors.
3. The method for constructing sentence representations fusing internal structural information of chinese words according to claim 1, wherein inputting the stitched vector into a feedforward neural network and performing nonlinear transformation specifically comprises:
determining a mask vector v according to the following formulaij:
vij=tanh(W·[cij;xi])
Where tanh () represents a hyperbolic tangent function, W is a parameter of a feedforward neural network, cijIs the ith word corpus xiThe jth pre-training word vector.
4. The method according to claim 1, wherein the determining a combined word vector of the corpus according to all pre-training word vectors and corresponding mask vectors in each corpus specifically comprises:
according to the following formula, the inner products of all the pre-training word vectors in each word corpus and the corresponding mask vectors are summed to obtain the combined word vector of the word corpus
Wherein, cijIs the ith word corpus xiMiddle j-th pre-training word vector, vijFor pre-training word vector cijAnd m represents the total number of pre-training word vectors of the ith word corpus.
5. The method according to claim 1, wherein the determining a final word vector of the corpus according to the pre-training word vector and the combined word vector in each corpus specifically comprises:
taking the maximum value of each dimension of the pre-training word vector and the combined word vector as a final word vector based on a maximum pooling method according to the following formula
Wherein the content of the first and second substances,a pre-training word vector representing the ith word corpus in the kth dimension,represents the combined word vector of the ith word corpus in the k dimension, d represents all dimensions of the ith word corpus, and max () represents the function of taking the maximum value.
6. The method according to claim 1, wherein the step of integrating the final word vectors of the word corpora in the sentence to be processed to obtain the expression vector of the sentence to be processed includes:
and integrating the final word vectors into a representation vector of the sentence to be processed through a sentence combination function.
7. The method of claim 6, wherein the sentence combination function comprises at least one of an Average model function, a Matrix model function, a Dan model function, an RNN model function, and an LSTM model function.
8. The method for constructing sentence representation fusing internal structural information of Chinese words according to any of claims 1-7, wherein the training corpus is Chinese text corpus crawled from Baidu encyclopedia.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710449875.3A CN107423284B (en) | 2017-06-14 | 2017-06-14 | Method and system for constructing sentence representation fusing internal structure information of Chinese words |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710449875.3A CN107423284B (en) | 2017-06-14 | 2017-06-14 | Method and system for constructing sentence representation fusing internal structure information of Chinese words |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107423284A CN107423284A (en) | 2017-12-01 |
CN107423284B true CN107423284B (en) | 2020-03-06 |
Family
ID=60428673
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710449875.3A Active CN107423284B (en) | 2017-06-14 | 2017-06-14 | Method and system for constructing sentence representation fusing internal structure information of Chinese words |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107423284B (en) |
Families Citing this family (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108595416A (en) * | 2018-03-27 | 2018-09-28 | 义语智能科技(上海)有限公司 | Character string processing method and equipment |
CN108717406B (en) * | 2018-05-10 | 2021-08-24 | 平安科技(深圳)有限公司 | Text emotion analysis method and device and storage medium |
CN108986797B (en) * | 2018-08-06 | 2021-07-06 | 中国科学技术大学 | Voice theme recognition method and system |
CN111382249B (en) * | 2018-12-29 | 2023-10-10 | 深圳市优必选科技有限公司 | Chat corpus cleaning method and device, computer equipment and storage medium |
CN111538817B (en) * | 2019-01-18 | 2024-06-18 | 北京京东尚科信息技术有限公司 | Man-machine interaction method and device |
CN109992788B (en) * | 2019-04-10 | 2023-08-29 | 鼎富智能科技有限公司 | Deep text matching method and device based on unregistered word processing |
CN110263323B (en) * | 2019-05-08 | 2020-08-28 | 清华大学 | Keyword extraction method and system based on barrier type long-time memory neural network |
CN110245353B (en) * | 2019-06-20 | 2022-10-28 | 腾讯科技(深圳)有限公司 | Natural language expression method, device, equipment and storage medium |
CN112825109B (en) * | 2019-11-20 | 2024-02-23 | 南京贝湾信息科技有限公司 | Sentence alignment method and computing device |
CN112906370B (en) * | 2019-12-04 | 2022-12-20 | 马上消费金融股份有限公司 | Intention recognition model training method, intention recognition method and related device |
CN111259148B (en) * | 2020-01-19 | 2024-03-26 | 北京小米松果电子有限公司 | Information processing method, device and storage medium |
CN111581335B (en) * | 2020-05-14 | 2023-11-24 | 腾讯科技(深圳)有限公司 | Text representation method and device |
CN111507099A (en) * | 2020-06-19 | 2020-08-07 | 平安科技(深圳)有限公司 | Text classification method and device, computer equipment and storage medium |
CN111832301A (en) * | 2020-07-28 | 2020-10-27 | 电子科技大学 | Chinese word vector generation method based on adaptive component n-tuple |
CN112733520B (en) * | 2020-12-30 | 2023-07-18 | 望海康信(北京)科技股份公司 | Text similarity calculation method, system, corresponding equipment and storage medium |
CN112765325A (en) * | 2021-01-27 | 2021-05-07 | 语联网(武汉)信息技术有限公司 | Vertical field corpus data screening method and system |
CN113158624B (en) * | 2021-04-09 | 2023-12-08 | 中国人民解放军国防科技大学 | Method and system for fine tuning pre-training language model by fusing language information in event extraction |
CN113379032A (en) * | 2021-06-08 | 2021-09-10 | 全球能源互联网研究院有限公司 | Layered bidirectional LSTM sequence model training method and system |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105653671A (en) * | 2015-12-29 | 2016-06-08 | 畅捷通信息技术股份有限公司 | Similar information recommendation method and system |
CN106227721A (en) * | 2016-08-08 | 2016-12-14 | 中国科学院自动化研究所 | Chinese Prosodic Hierarchy prognoses system |
CN106383816A (en) * | 2016-09-26 | 2017-02-08 | 大连民族大学 | Chinese minority region name identification method based on deep learning |
CN106547735A (en) * | 2016-10-25 | 2017-03-29 | 复旦大学 | The structure and using method of the dynamic word or word vector based on the context-aware of deep learning |
-
2017
- 2017-06-14 CN CN201710449875.3A patent/CN107423284B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105653671A (en) * | 2015-12-29 | 2016-06-08 | 畅捷通信息技术股份有限公司 | Similar information recommendation method and system |
CN106227721A (en) * | 2016-08-08 | 2016-12-14 | 中国科学院自动化研究所 | Chinese Prosodic Hierarchy prognoses system |
CN106383816A (en) * | 2016-09-26 | 2017-02-08 | 大连民族大学 | Chinese minority region name identification method based on deep learning |
CN106547735A (en) * | 2016-10-25 | 2017-03-29 | 复旦大学 | The structure and using method of the dynamic word or word vector based on the context-aware of deep learning |
Also Published As
Publication number | Publication date |
---|---|
CN107423284A (en) | 2017-12-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107423284B (en) | Method and system for constructing sentence representation fusing internal structure information of Chinese words | |
CN110287494A (en) | A method of the short text Similarity matching based on deep learning BERT algorithm | |
Palangi et al. | Deep sentence embedding using long short-term memory networks: Analysis and application to information retrieval | |
Chen et al. | Research on text sentiment analysis based on CNNs and SVM | |
CN110134954B (en) | Named entity recognition method based on Attention mechanism | |
Li et al. | A method of emotional analysis of movie based on convolution neural network and bi-directional LSTM RNN | |
CN109086269B (en) | Semantic bilingual recognition method based on semantic resource word representation and collocation relationship | |
Cai et al. | Intelligent question answering in restricted domains using deep learning and question pair matching | |
CN110489554B (en) | Attribute-level emotion classification method based on location-aware mutual attention network model | |
Diao et al. | A multi-dimension question answering network for sarcasm detection | |
Hu et al. | Considering optimization of English grammar error correction based on neural network | |
CN112784602A (en) | News emotion entity extraction method based on remote supervision | |
CN110297986A (en) | A kind of Sentiment orientation analysis method of hot microblog topic | |
CN111159405B (en) | Irony detection method based on background knowledge | |
Wu et al. | An effective approach of named entity recognition for cyber threat intelligence | |
CN115357719A (en) | Power audit text classification method and device based on improved BERT model | |
El Desouki et al. | Exploring the recent trends of paraphrase detection | |
CN114417851A (en) | Emotion analysis method based on keyword weighted information | |
CN114722176A (en) | Intelligent question answering method, device, medium and electronic equipment | |
Wu et al. | Machine translation of English speech: Comparison of multiple algorithms | |
Zheng et al. | A novel hierarchical convolutional neural network for question answering over paragraphs | |
CN114970557A (en) | Knowledge enhancement-based cross-language structured emotion analysis method | |
Ji et al. | Research on semantic similarity calculation methods in Chinese financial intelligent customer service | |
CN115878752A (en) | Text emotion analysis method, device, equipment, medium and program product | |
Xu et al. | Research on multi-feature fusion entity relation extraction based on deep learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |