CN107423284B - Method and system for constructing sentence representation fusing internal structure information of Chinese words - Google Patents

Method and system for constructing sentence representation fusing internal structure information of Chinese words Download PDF

Info

Publication number
CN107423284B
CN107423284B CN201710449875.3A CN201710449875A CN107423284B CN 107423284 B CN107423284 B CN 107423284B CN 201710449875 A CN201710449875 A CN 201710449875A CN 107423284 B CN107423284 B CN 107423284B
Authority
CN
China
Prior art keywords
word
training
corpus
vector
vectors
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710449875.3A
Other languages
Chinese (zh)
Other versions
CN107423284A (en
Inventor
王少楠
张家俊
宗成庆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN201710449875.3A priority Critical patent/CN107423284B/en
Publication of CN107423284A publication Critical patent/CN107423284A/en
Application granted granted Critical
Publication of CN107423284B publication Critical patent/CN107423284B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to the technical field of natural language processing, in particular to a method and a system for constructing sentence expression fused with internal structure information of Chinese words, aiming at solving the problem of low utilization rate of the internal structure information of the words; the construction method comprises the following steps: performing word segmentation on all Chinese repeated statement sentence pairs in the training corpus to obtain a plurality of word corpuses; pre-training each word corpus to obtain pre-training word vectors and pre-training word vectors; integrating all pre-training word vectors and pre-training word vectors in each word corpus to obtain a combined word vector corresponding to the word corpus; determining a final word vector of each word corpus according to the pre-training word vector and the combined word vector in each word corpus, wherein the final word vector represents word internal structure information; and integrating the final word vector of each word corpus in the sentence to be processed to obtain the expression vector of the sentence to be processed. The invention can improve the utilization rate of the internal structure information of the words.

Description

Method and system for constructing sentence representation fusing internal structure information of Chinese words
Technical Field
The invention relates to the technical field of natural language processing, in particular to a method and a system for constructing sentence expression fusing internal structure information of Chinese words.
Background
Sentence representation is the mapping of a natural language sentence into a high dimensional space such that semantically similar sentences have a closer distance in this space. Sentence representation is the fundamental task of natural language processing, directly affecting the performance of the entire language processing system. Therefore, much effort has been devoted to the study of how to design suitable sentence representation methods for specific tasks to improve the performance of language processing systems.
The traditional sentence representation method uses a large number of manually designed features to represent the meaning of a sentence, and achieves good effect in various natural language processing tasks. However, the method requires a lot of manpower and professional knowledge, and often needs to select features according to different tasks, so that the problems of poor model generalization capability and difficult feature representation are caused. In recent years, people find that the neural network-based model can automatically extract semantic features of sentences from large-scale texts, and greatly improve the semantic expression effect of the sentences.
However, most of the sentence representation studies are directed to english sentences, and different neural network structures are designed on the word granularity to encode the semantics of the sentences. Different from English, Chinese words are formed by characters, and the characters contain rich semantic information and can reflect the meaning of the words. In fact, researchers have noted this problem and improved the learning of word vectors by using the words in the chinese words, but these methods do not fully utilize the internal information of the chinese words, such as the relationships between words, and are limited to the task of learning word vectors and do not search in the sentence representation. Therefore, how to fully utilize the internal structure information of the words to learn a better sentence representation model is a topic worthy of research.
Disclosure of Invention
In order to solve the problems in the prior art, namely the problem of low utilization rate of the internal structure information of the words, the invention provides a method and a system for constructing sentence expression fusing the internal structure information of Chinese words.
In order to solve the technical problems, the invention provides the following scheme:
a construction method for sentence representation fusing internal structure information of Chinese words comprises the following steps:
performing word segmentation on all Chinese repeated statement sentence pairs in the training corpus to obtain a plurality of word corpuses;
pre-training each word corpus to obtain pre-training word vectors and pre-training word vectors;
integrating all pre-training word vectors and pre-training word vectors in each word corpus to obtain a combined word vector corresponding to the word corpus;
determining a final word vector of each word corpus according to the pre-training word vector and the combined word vector in each word corpus, wherein the final word vector represents word internal structure information;
and integrating the final word vector of each word corpus in the sentence to be processed to obtain the expression vector of the sentence to be processed.
Optionally, the pre-training of each word corpus specifically includes:
splitting each word corpus according to characters to obtain a character corpus;
splicing word linguistic data and word linguistic data to obtain a word vector and a word vector;
and pre-training the character vectors and the word vectors by utilizing an open source model to obtain corresponding pre-training character vectors and pre-training word vectors.
Optionally, the integrating all the pre-training word vectors and pre-training word vectors in each word corpus specifically includes:
splicing the pre-training word vector and the pre-training word vector of each word corpus to obtain a spliced vector corresponding to the pre-training word vector;
inputting the spliced vector into a feedforward neural network and carrying out nonlinear transformation to obtain a mask vector corresponding to the pre-training word vector;
and determining a combined word vector of each word corpus according to all the pre-training word vectors and the corresponding mask vectors in each word corpus.
Optionally, inputting the stitching vector into a feedforward neural network and performing nonlinear transformation, specifically including:
determining a mask vector v according to the following formulaij
vij=tanh(W·[cij;xi])
Where tanh () represents a hyperbolic tangent function, W is a parameter of a feedforward neural network, cijIs the ith word corpus xiThe jth pre-training word vector.
Optionally, the determining a combined word vector of each word corpus according to all pre-training word vectors and corresponding mask vectors in each word corpus specifically includes:
according to the following formula, the inner products of all the pre-training word vectors in each word corpus and the corresponding mask vectors are summed to obtain the combined word vector of the word corpus
Figure BDA0001321670590000031
Figure BDA0001321670590000032
Wherein, cijIs the ith word corpus xiMiddle j-th pre-training word vector, vijFor pre-training word vector cijAnd m represents the total number of pre-training word vectors of the ith word corpus.
Optionally, the determining a final word vector of each word corpus according to the pre-training word vector and the combined word vector in each word corpus specifically includes:
taking the maximum value of each dimension of the pre-training word vector and the combined word vector as a final word vector based on a maximum pooling method according to the following formula
Figure BDA0001321670590000041
Figure BDA0001321670590000042
Wherein the content of the first and second substances,
Figure BDA0001321670590000043
a pre-training word vector representing the ith word corpus in the kth dimension,
Figure BDA0001321670590000044
represents the combined word vector of the ith word corpus in the k dimension, d represents all dimensions of the ith word corpus, and max () represents the function of taking the maximum value.
Optionally, the integrating the final word vector of each word corpus in the sentence to be processed to obtain the representation vector of the sentence to be processed specifically includes:
and integrating the final word vectors into a representation vector of the sentence to be processed through a sentence combination function.
Optionally, the sentence combination function includes at least one of an Average model function, a Matrix model function, a Dan model function, an RNN model function, and an LSTM model function.
Optionally, the corpus is chinese text corpus crawled from hundred degree encyclopedia.
According to the embodiment of the invention, the invention discloses the following technical effects:
the method for constructing the sentence representation fusing the internal structure information of the Chinese words integrates the final word vectors representing the internal structure information of the words so as to accurately determine the representation vector of the sentence to be processed and improve the utilization rate of the internal structure information of the words by segmenting the training corpus, pre-training the word corpus, integrating the pre-training word vectors and determining the final word vectors.
In order to solve the technical problems, the invention also provides the following scheme:
a construction system for fusing sentence representations of internal structural information of chinese words, the construction system comprising:
the word segmentation unit is used for performing word segmentation on all Chinese repeated statement sentence pairs in the training corpus to obtain a plurality of word corpuses;
the pre-training unit is used for pre-training each word corpus to obtain pre-training word vectors and pre-training word vectors;
the first integration unit is used for integrating all pre-training word vectors and pre-training word vectors in each word corpus to obtain a combined word vector corresponding to the word corpus;
the determining unit is used for determining a final word vector of each word corpus according to a pre-training word vector and the combined word vector in each word corpus, and the final word vector represents word internal structure information;
and the second integration unit is used for integrating the final word vector of each word corpus in the sentence to be processed to obtain the expression vector of the sentence to be processed.
According to the embodiment of the invention, the invention discloses the following technical effects:
the building system for sentence representation of Chinese word internal structure information is provided with the word segmentation unit, the pre-training unit, the first integration unit, the determination unit and the second integration unit, can perform word segmentation processing on a training corpus, pre-train a word corpus, integrate a pre-training word vector and determine a final word vector, thereby integrating a plurality of final word vectors representing the word internal structure information to accurately determine the representation vector of the sentence to be processed and improving the utilization rate of the word internal structure information.
Drawings
FIG. 1 is a flow chart of a method of constructing a sentence representation incorporating internal structural information of Chinese words according to the present invention;
FIG. 2 is a schematic diagram of a modular structure of a sentence expression construction system for fusing internal structural information of Chinese words according to the present invention.
Description of the symbols:
word segmentation unit-1, pre-training unit-2, first integration unit-3, determination unit-4, second integration unit-5.
Detailed Description
Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and are not intended to limit the scope of the present invention.
The invention provides a method for constructing sentence representation fused with internal structure information of Chinese words, which integrates a plurality of final word vectors representing internal structure information of words to accurately determine a representation vector of a sentence to be processed and improve the utilization rate of the internal structure information of the words by performing word segmentation processing on a training corpus, performing pre-training on the word corpus, integrating a pre-training word vector and determining a final word vector.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
As shown in FIG. 1, the method for constructing sentence expression by fusing internal structure information of Chinese words in the invention comprises:
step 100: performing word segmentation on all Chinese repeated statement sentence pairs in the training corpus to obtain a plurality of word corpuses;
step 200: pre-training each word corpus to obtain pre-training word vectors and pre-training word vectors;
step 300: integrating all pre-training word vectors and pre-training word vectors in each word corpus to obtain a combined word vector corresponding to the word corpus;
step 400: determining a final word vector of each word corpus according to the pre-training word vector and the combined word vector in each word corpus, wherein the final word vector represents word internal structure information;
step 500: and integrating the final word vector of each word corpus in the sentence to be processed to obtain the expression vector of the sentence to be processed.
In step 100, the corpus is a chinese text corpus crawled from hundred encyclopedia.
There are many ways to segment chinese sentences. In the embodiment, the Chinese sentences are segmented by using an open-source segmentation tool.
The sentence pairs are repeated in Chinese:
Figure BDA0001321670590000071
for example, after word segmentation, the Chinese restated sentence pair can be expressed as:
Figure BDA0001321670590000072
in step 200, the pre-training of each word corpus specifically includes:
step 201: and splitting each word corpus according to characters to obtain a word corpus.
Step 202: and splicing the word linguistic data and the word linguistic data to obtain a word vector and a word vector.
Step 203: and pre-training the character vectors and the word vectors by utilizing an open source model to obtain corresponding pre-training character vectors and pre-training word vectors.
In this embodiment, the open source model is a skip-gram model, but not limited thereto.
Taking "japan" as an example, the obtained 300-dimensional word vector and word vector are:
"Japanese-0.2434300.2944200.188458-0.0929210.1392860.1865990.011289-0.218883-0.1810620.152754 …";
"Ri-0.3849000.2144930.187968-0.0384640.0575210.069445-0.218115-0.035687-0.126120-0.419776-0.312976 …".
In step 300, the integrating all the pre-training word vectors and pre-training word vectors in each word corpus specifically includes:
step 301: and splicing the pre-training word vector and the pre-training word vector of each word corpus to obtain a spliced vector corresponding to the pre-training word vector.
Taking "japan" as an example, all the pre-training word vectors "day", "this", and the pre-training word vector "japan" in one word corpus "japan" are spliced to obtain two 600-dimensional spliced vectors.
Step 302: and inputting the spliced vector into a feedforward neural network and carrying out nonlinear transformation to obtain a mask vector corresponding to the pre-training word vector.
Inputting the splicing vector into a feedforward neural network and carrying out nonlinear transformation, specifically comprising:
determining a mask vector vijAs shown in equation (1):
vij=tanh(W·[cij;xi]) (1)
where tanh () represents a hyperbolic tangent function, W is a parameter of a feedforward neural network, cijIs the ith word corpus xiJ-th pre-trainingA word vector. Wherein the mask vector vijTo control the ith word corpus xiInfluence on the meaning of the jth pre-training word vector.
Step 303: and determining a combined word vector of each word corpus according to all the pre-training word vectors and the corresponding mask vectors in each word corpus.
Wherein, the determining the combined word vector of the word corpus according to all the pre-training word vectors and the corresponding mask vector in each word corpus specifically includes:
summing the inner products of all pre-training word vectors and corresponding mask vectors in each word corpus to obtain a combined word vector of the word corpus
Figure BDA0001321670590000081
As shown in equation (2):
Figure BDA0001321670590000082
wherein, cijIs the ith word corpus xiMiddle j-th pre-training word vector, vijFor pre-training word vector cijAnd m represents the total number of pre-training word vectors of the ith word corpus.
In step 400, the determining a final word vector of the word corpus according to the pre-training word vector and the compound word vector in each word corpus specifically includes:
based on the maximum pooling method, the maximum value is taken from each dimension of the pre-training word vector and the combined word vector as the final word vector
Figure BDA0001321670590000091
As shown in equation (3):
Figure BDA0001321670590000092
wherein the content of the first and second substances,
Figure BDA0001321670590000093
a pre-training word vector representing the ith word corpus in the kth dimension,
Figure BDA0001321670590000094
represents the combined word vector of the ith word corpus in the k dimension, d represents all dimensions of the ith word corpus, and max () represents the function of taking the maximum value.
In step 500, the integrating the final word vector of each word corpus in the sentence to be processed to obtain the representation vector of the sentence to be processed specifically includes:
and integrating the final word vectors into a representation vector of the sentence to be processed through a sentence combination function.
The sentence combination function includes at least one of Average model function, Matrix model function, Dan (transform) model function, RNN (recurrent neural network) model function, and LSTM (long-short term memory) model function.
Average model function, which is to Average the vector representation of all words in a sentence to obtain the final sentence representation RsentenceAs shown in equation (4):
Figure BDA0001321670590000095
the Matrix model function firstly uses the Average model function to obtain the vector representation of the sentence, then multiplies the vector representation by a Matrix and carries out nonlinear transformation to obtain the final sentence representation, as shown in formula (5):
Figure BDA0001321670590000101
the Dan model function firstly uses Average model function to obtain vector representation of sentence, and then uses multilayer feedforward neural network to transform the sentence representation to obtain final sentence representation, as shown in formula (6):
Figure BDA0001321670590000102
RNN model functions, which combine word representations in a sentence to form a final sentence representation, as shown in equation (7):
Rsentence=RNN(x)=f(Wxxi+Whhi-1+b) (7)
the LSTM model function combines the word representations in a sentence to form a final sentence representation, as shown in equation (8).
Figure BDA0001321670590000103
After the expression vector of each sentence in the sentence pair is obtained, the model parameters are solved by maximizing the distance between the positive examples and the negative examples by adopting the maximum-interval objective function, as shown in the formula (9):
Figure BDA0001321670590000104
wherein, (x1, x2) represents a positive example, which is a sentence pair with similar meaning; (t1, t2) is a negative example, consisting of randomly combined pairs of sentences,
Figure BDA0001321670590000105
the sentence representing sentence x represents a vector.
Table 1 shows the results of the comparison of the present invention with the word-based model, the word-based model and the word-mean model over three sets of test data (big data, hundredth data, Total (sum of big data and hundredth data)). The training data includes 30846 sentence pairs. From table 1 it can be found: compared with a word-based model, the evaluation index (Pearson) of the correlation degree between the model predicted value and the standard numerical value is improved by 2.00% of the Pearson correlation averagely, and compared with a word average model, the evaluation index (Pearson) of the correlation degree between the model predicted value and the standard numerical value is improved by 1.52% of the Pearson correlation. The effectiveness and the superiority of the construction method of the sentence expression fusing the internal structure information of the Chinese words are fully demonstrated.
TABLE 1 Pearson relevance across different sentence similarity test sets
Figure BDA0001321670590000111
Figure BDA0001321670590000121
In addition, the attached table 2 shows the effect of the present invention on word similarity test set with word-based models, word-based models and word-mean models. The method can be directly obtained, and the performance of word representation can be effectively improved.
TABLE 2 Pearson relevance on word similarity test set
Figure BDA0001321670590000122
The construction method for sentence representation fusing the internal structure information of the Chinese words has the following positive effects: the Chinese words are formed by words, and for most words, the meaning of the words greatly influences the meaning of the words formed by the words; while a small portion of chinese words are non-combinative words whose meaning is independent of the meaning of the constituent words. According to the invention, by modeling the internal structural characteristics of the Chinese words, the word representation effect can be effectively improved, and non-combined words can be automatically identified to a certain extent. The invention uses mask door mechanism to control the contribution degree of different words in a word to the word semantic, uses the maximum pooling method to select the meaning of the word as a whole or combined by the word meanings, and automatically learns the weights of the two.
Through experiments on the task of Chinese sentence similarity, experimental results show that compared with a sentence representation model based on words, the method has the advantage that the average Pearson correlation is improved by 2.00%; compared with a sentence representation model based on a word average method, the method has the advantage that the Pearson relevance is improved by 1.52 percent on average. This fully demonstrates the effectiveness and superiority of fusing the internal structure of words.
In addition, the invention also provides a construction system for sentence representation by fusing the internal structure information of the Chinese words. As shown in FIG. 2, the system for constructing sentence expression fused with internal structure information of Chinese words according to the present invention includes a segmentation unit 1, a pre-training unit 2, a first integration unit 3, a determination unit 4, and a second integration unit 5.
The word segmentation unit 1 is used for performing word segmentation processing on all Chinese repeated statement sentence pairs in a training corpus to obtain a plurality of word corpuses; the pre-training unit 2 is used for pre-training each word corpus to obtain pre-training word vectors and pre-training word vectors; the first integration unit 3 is configured to integrate all pre-training word vectors and pre-training word vectors in each word corpus to obtain a combined word vector corresponding to the word corpus; the determining unit 4 is configured to determine a final word vector of each word corpus according to a pre-training word vector and the combined word vector in each word corpus, where the final word vector represents word internal structure information; the second integration unit 5 is configured to integrate final word vectors of word corpora in the sentence to be processed to obtain a representation vector of the sentence to be processed.
Compared with the prior art, the construction system for the sentence expression fusing the internal structure information of the Chinese word has the same beneficial effects as the construction method for the sentence expression fusing the internal structure information of the Chinese word, and the description is omitted.
So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims (8)

1. A construction method for sentence representation fusing internal structure information of Chinese words is characterized by comprising the following steps:
performing word segmentation on all Chinese repeated statement sentence pairs in the training corpus to obtain a plurality of word corpuses;
pre-training each word corpus to obtain pre-training word vectors and pre-training word vectors;
integrating all pre-training word vectors and pre-training word vectors in each word corpus to obtain a combined word vector corresponding to the word corpus;
determining a final word vector of each word corpus according to the pre-training word vector and the combined word vector in each word corpus, wherein the final word vector represents word internal structure information;
integrating final word vectors of word corpora in the sentence to be processed to obtain a representation vector of the sentence to be processed;
wherein, said integrating all pre-training word vectors and pre-training word vectors in each word corpus specifically comprises:
splicing the pre-training word vector and the pre-training word vector of each word corpus to obtain a spliced vector corresponding to the pre-training word vector;
inputting the spliced vector into a feedforward neural network and carrying out nonlinear transformation to obtain a mask vector corresponding to the pre-training word vector;
and determining a combined word vector of each word corpus according to all the pre-training word vectors and the corresponding mask vectors in each word corpus.
2. The method for constructing sentence representations fusing internal structural information of chinese words according to claim 1, wherein the pre-training of each word corpus specifically comprises:
splitting each word corpus according to characters to obtain a character corpus;
splicing word linguistic data and word linguistic data to obtain a word vector and a word vector;
and pre-training the character vectors and the word vectors by utilizing an open source model to obtain corresponding pre-training character vectors and pre-training word vectors.
3. The method for constructing sentence representations fusing internal structural information of chinese words according to claim 1, wherein inputting the stitched vector into a feedforward neural network and performing nonlinear transformation specifically comprises:
determining a mask vector v according to the following formulaij
vij=tanh(W·[cij;xi])
Where tanh () represents a hyperbolic tangent function, W is a parameter of a feedforward neural network, cijIs the ith word corpus xiThe jth pre-training word vector.
4. The method according to claim 1, wherein the determining a combined word vector of the corpus according to all pre-training word vectors and corresponding mask vectors in each corpus specifically comprises:
according to the following formula, the inner products of all the pre-training word vectors in each word corpus and the corresponding mask vectors are summed to obtain the combined word vector of the word corpus
Figure FDA0002175268980000021
Figure FDA0002175268980000022
Wherein, cijIs the ith word corpus xiMiddle j-th pre-training word vector, vijFor pre-training word vector cijAnd m represents the total number of pre-training word vectors of the ith word corpus.
5. The method according to claim 1, wherein the determining a final word vector of the corpus according to the pre-training word vector and the combined word vector in each corpus specifically comprises:
taking the maximum value of each dimension of the pre-training word vector and the combined word vector as a final word vector based on a maximum pooling method according to the following formula
Figure FDA0002175268980000031
Figure FDA0002175268980000032
Wherein the content of the first and second substances,
Figure FDA0002175268980000033
a pre-training word vector representing the ith word corpus in the kth dimension,
Figure FDA0002175268980000034
represents the combined word vector of the ith word corpus in the k dimension, d represents all dimensions of the ith word corpus, and max () represents the function of taking the maximum value.
6. The method according to claim 1, wherein the step of integrating the final word vectors of the word corpora in the sentence to be processed to obtain the expression vector of the sentence to be processed includes:
and integrating the final word vectors into a representation vector of the sentence to be processed through a sentence combination function.
7. The method of claim 6, wherein the sentence combination function comprises at least one of an Average model function, a Matrix model function, a Dan model function, an RNN model function, and an LSTM model function.
8. The method for constructing sentence representation fusing internal structural information of Chinese words according to any of claims 1-7, wherein the training corpus is Chinese text corpus crawled from Baidu encyclopedia.
CN201710449875.3A 2017-06-14 2017-06-14 Method and system for constructing sentence representation fusing internal structure information of Chinese words Active CN107423284B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710449875.3A CN107423284B (en) 2017-06-14 2017-06-14 Method and system for constructing sentence representation fusing internal structure information of Chinese words

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710449875.3A CN107423284B (en) 2017-06-14 2017-06-14 Method and system for constructing sentence representation fusing internal structure information of Chinese words

Publications (2)

Publication Number Publication Date
CN107423284A CN107423284A (en) 2017-12-01
CN107423284B true CN107423284B (en) 2020-03-06

Family

ID=60428673

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710449875.3A Active CN107423284B (en) 2017-06-14 2017-06-14 Method and system for constructing sentence representation fusing internal structure information of Chinese words

Country Status (1)

Country Link
CN (1) CN107423284B (en)

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108595416A (en) * 2018-03-27 2018-09-28 义语智能科技(上海)有限公司 Character string processing method and equipment
CN108717406B (en) * 2018-05-10 2021-08-24 平安科技(深圳)有限公司 Text emotion analysis method and device and storage medium
CN108986797B (en) * 2018-08-06 2021-07-06 中国科学技术大学 Voice theme recognition method and system
CN111382249B (en) * 2018-12-29 2023-10-10 深圳市优必选科技有限公司 Chat corpus cleaning method and device, computer equipment and storage medium
CN111538817B (en) * 2019-01-18 2024-06-18 北京京东尚科信息技术有限公司 Man-machine interaction method and device
CN109992788B (en) * 2019-04-10 2023-08-29 鼎富智能科技有限公司 Deep text matching method and device based on unregistered word processing
CN110263323B (en) * 2019-05-08 2020-08-28 清华大学 Keyword extraction method and system based on barrier type long-time memory neural network
CN110245353B (en) * 2019-06-20 2022-10-28 腾讯科技(深圳)有限公司 Natural language expression method, device, equipment and storage medium
CN112825109B (en) * 2019-11-20 2024-02-23 南京贝湾信息科技有限公司 Sentence alignment method and computing device
CN112906370B (en) * 2019-12-04 2022-12-20 马上消费金融股份有限公司 Intention recognition model training method, intention recognition method and related device
CN111259148B (en) * 2020-01-19 2024-03-26 北京小米松果电子有限公司 Information processing method, device and storage medium
CN111581335B (en) * 2020-05-14 2023-11-24 腾讯科技(深圳)有限公司 Text representation method and device
CN111507099A (en) * 2020-06-19 2020-08-07 平安科技(深圳)有限公司 Text classification method and device, computer equipment and storage medium
CN111832301A (en) * 2020-07-28 2020-10-27 电子科技大学 Chinese word vector generation method based on adaptive component n-tuple
CN112733520B (en) * 2020-12-30 2023-07-18 望海康信(北京)科技股份公司 Text similarity calculation method, system, corresponding equipment and storage medium
CN112765325A (en) * 2021-01-27 2021-05-07 语联网(武汉)信息技术有限公司 Vertical field corpus data screening method and system
CN113158624B (en) * 2021-04-09 2023-12-08 中国人民解放军国防科技大学 Method and system for fine tuning pre-training language model by fusing language information in event extraction
CN113379032A (en) * 2021-06-08 2021-09-10 全球能源互联网研究院有限公司 Layered bidirectional LSTM sequence model training method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105653671A (en) * 2015-12-29 2016-06-08 畅捷通信息技术股份有限公司 Similar information recommendation method and system
CN106227721A (en) * 2016-08-08 2016-12-14 中国科学院自动化研究所 Chinese Prosodic Hierarchy prognoses system
CN106383816A (en) * 2016-09-26 2017-02-08 大连民族大学 Chinese minority region name identification method based on deep learning
CN106547735A (en) * 2016-10-25 2017-03-29 复旦大学 The structure and using method of the dynamic word or word vector based on the context-aware of deep learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105653671A (en) * 2015-12-29 2016-06-08 畅捷通信息技术股份有限公司 Similar information recommendation method and system
CN106227721A (en) * 2016-08-08 2016-12-14 中国科学院自动化研究所 Chinese Prosodic Hierarchy prognoses system
CN106383816A (en) * 2016-09-26 2017-02-08 大连民族大学 Chinese minority region name identification method based on deep learning
CN106547735A (en) * 2016-10-25 2017-03-29 复旦大学 The structure and using method of the dynamic word or word vector based on the context-aware of deep learning

Also Published As

Publication number Publication date
CN107423284A (en) 2017-12-01

Similar Documents

Publication Publication Date Title
CN107423284B (en) Method and system for constructing sentence representation fusing internal structure information of Chinese words
CN110287494A (en) A method of the short text Similarity matching based on deep learning BERT algorithm
Palangi et al. Deep sentence embedding using long short-term memory networks: Analysis and application to information retrieval
Chen et al. Research on text sentiment analysis based on CNNs and SVM
CN110134954B (en) Named entity recognition method based on Attention mechanism
Li et al. A method of emotional analysis of movie based on convolution neural network and bi-directional LSTM RNN
CN109086269B (en) Semantic bilingual recognition method based on semantic resource word representation and collocation relationship
Cai et al. Intelligent question answering in restricted domains using deep learning and question pair matching
CN110489554B (en) Attribute-level emotion classification method based on location-aware mutual attention network model
Diao et al. A multi-dimension question answering network for sarcasm detection
Hu et al. Considering optimization of English grammar error correction based on neural network
CN112784602A (en) News emotion entity extraction method based on remote supervision
CN110297986A (en) A kind of Sentiment orientation analysis method of hot microblog topic
CN111159405B (en) Irony detection method based on background knowledge
Wu et al. An effective approach of named entity recognition for cyber threat intelligence
CN115357719A (en) Power audit text classification method and device based on improved BERT model
El Desouki et al. Exploring the recent trends of paraphrase detection
CN114417851A (en) Emotion analysis method based on keyword weighted information
CN114722176A (en) Intelligent question answering method, device, medium and electronic equipment
Wu et al. Machine translation of English speech: Comparison of multiple algorithms
Zheng et al. A novel hierarchical convolutional neural network for question answering over paragraphs
CN114970557A (en) Knowledge enhancement-based cross-language structured emotion analysis method
Ji et al. Research on semantic similarity calculation methods in Chinese financial intelligent customer service
CN115878752A (en) Text emotion analysis method, device, equipment, medium and program product
Xu et al. Research on multi-feature fusion entity relation extraction based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant