CN109902310A - Vocabulary detection method, vocabulary detection system and computer readable storage medium - Google Patents
Vocabulary detection method, vocabulary detection system and computer readable storage medium Download PDFInfo
- Publication number
- CN109902310A CN109902310A CN201910035746.9A CN201910035746A CN109902310A CN 109902310 A CN109902310 A CN 109902310A CN 201910035746 A CN201910035746 A CN 201910035746A CN 109902310 A CN109902310 A CN 109902310A
- Authority
- CN
- China
- Prior art keywords
- training data
- degree
- vocabulary
- correlation
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention proposes a kind of vocabulary detection method, vocabulary detection system and computer readable storage mediums, wherein vocabulary detection method includes: acquisition training data;Training data is inputted into composite network model, to obtain the context vector and degree of correlation information of training data;Based on context vector and degree of correlation information determine the target vocabulary in training data;Wherein, composite network model is made of two-way length memory network and two-way attention network.Vocabulary detection method proposed by the present invention chooses two-way length memory network as the method network for extracting feature, two-way attention network is chosen as the core network for generating the degree of correlation, " degree of correlation " very high neologisms are obtained, to guarantee the accurate discovery to neologisms, obtain neologisms required for us.
Description
Technical field
The present invention relates to field of computer technology, in particular to a kind of vocabulary detection method, a kind of vocabulary detection system
System and a kind of computer readable storage medium.
Background technique
In the various fields of Text extraction, new word discovery namely finds new word.In the related technology, it is
Using the feature of word in text, neologisms are found out with the similarity of word vector characteristics vector.This new word discovery has a mistake
Area is to be difficult to find completely new neologisms, and discovery is all that the words of more words (for example has headache, discovery is new in dictionary mostly
Word is had a bad headache) either wrong word (Amoxicillin and Amdinocillin).Because above-mentioned Technology application is word vector model, word to
Amount only considers the degree of correlation of word and word in different vocabulary, it is found that word and original word it is substantially similar, can not find us
Really desired " neologisms ", adaptability is not high.
Summary of the invention
The present invention is directed at least solve one of the technical problems existing in the prior art.
For this purpose, first aspect present invention proposes a kind of vocabulary detection method.
Second aspect of the present invention proposes a kind of vocabulary detection system.
Third aspect present invention proposes a kind of computer readable storage medium.
First aspect present invention proposes a kind of vocabulary detection method, comprising: obtains training data;Training data is inputted
Composite network model, to obtain the context vector and degree of correlation information of training data;Based on context vector and degree of correlation letter
Cease the target vocabulary determined in training data;Wherein, composite network model is by two-way length memory network and two-way attention net
Network composition.
The vocabulary detection method that first aspect present invention proposes chooses two-way length memory network as the side for extracting feature
Method network, choose two-way attention network will acquire after getting training data as the core network for generating the degree of correlation
The composite network model that is made of two-way length memory network and two-way attention network of training data input in, it is multiple by this
Close network model and calculate and export the context vector and degree of correlation information of training data, with by the context of training data to
Amount and degree of correlation information determine the target vocabulary in training data.Specifically passing through the context vector and phase of training data
Pass degree INFORMATION DISCOVERY neologisms.
The vocabulary detection method that first aspect present invention proposes chooses two-way length memory network as the side for extracting feature
Method network chooses two-way attention network as the core network for generating the degree of correlation, obtains " degree of correlation " very high neologisms.Specifically
Ground, " degree of correlation " refer to substitute mutually in different context environmentals, are neologisms if it is that can substitute mutually, then
Indicate that the two degree of correlation is very high.For example, " today, I felt hand pain " and " today, I felt that shoulder aches " this two word, among these
" hand pain " and the similarity of " shoulder pain " word vector are very low, but the meaning expressed in this two word is similar, so
" hand pain " and " shoulder pain " degree of correlation are just very high.The present invention is based on the degree of correlation to find the neologisms in a word, to protect
The accurate discovery to neologisms is demonstrate,proved, neologisms required for us are obtained.
Above-mentioned vocabulary detection method according to the present invention, can also have following additional technical feature:
In the above-mentioned technical solutions, it is preferable that training data is inputted into composite network model, to obtain the upper of training data
It the step of below vector and degree of correlation information, specifically includes: during translating to training data, extracting training data
In the first contextual information;The first context vector and first degree of correlation of training data are determined according to the first contextual information
Information.
In the technical scheme, after training data being input to composite network model, Chinese and English are then carried out to it
Translation.Specifically, what is no matter inputted is English sentence or Chinese sentence, is all translated to it, to obtain corresponding English
Literary sentence and Chinese sentence;During translation, the contextual information of training data is extracted, obtains the first context
Information;Then the first context vector and the first degree of correlation information of training data are determined according to the first contextual information.
In any of the above-described technical solution, it is preferable that training data is inputted composite network model, to obtain training data
Context vector and the degree of correlation information the step of, specifically include: matched process carried out to training data and labeled data
In, extract the second contextual information of training data;The object vector of training data is determined according to the second contextual information;By mesh
Mark vector is compared with the label-vector of labeled data, and records comparison result;The second context is determined according to comparison result
Vector sum second degree of correlation information.
In the technical scheme, after the context vector of the training data of acquisition and degree of correlation information, comprehensively consider
The first context vector and the first degree of correlation information obtained in machine translation process, and obtained in sentence matching process
Two context vectors and the second degree of correlation information, to determine target vocabulary according to above-mentioned context vector and degree of correlation information, i.e.,
Neologisms in training data.
It specifically, can before determining the target vocabulary in training data according to the first calculated result, the second calculated result
To analyze the first calculated result and the second calculated result, the higher vocabulary of the degree of correlation is marked, is carried out " highlighted "
Processing, increases weight in attention matrix, so that relevant vocabulary obtains more concerns in two words, training effect is good.
Two relevant vocabulary are exported finally by two-way attention network, one is vocabulary existing in dictionary, and another word is thought for us
It was found that neologisms.
In any of the above-described technical solution, it is preferable that the step of obtaining training data specifically includes: obtaining corpus data;
Corpus data is pre-processed, to obtain training data.
In the technical scheme, machine translation semantic similarity data are chosen first, then to the corpus number of different language
It is matched one by one according to doing, removes messy code, disorderly and unsystematic item washes unwanted corpus data, and arranges label correctness.To same
One language degree of correlation training data does sentence (paragraph) matching, the training data needed.
Second aspect of the present invention proposes a kind of vocabulary detection system, comprising: memory, for storing computer program;
Processor, for execute computer program with: obtain training data;Training data is inputted into composite network model, to be instructed
Practice the context vector and degree of correlation information of data;Based on context vector and degree of correlation information determine the target in training data
Vocabulary;Wherein, composite network model is made of two-way length memory network and two-way attention network.
The vocabulary detection system that second aspect of the present invention proposes includes mutually matched memory and processor, place's storage
Storage computer program is stored on device, processor is made for executing the computer program to choose two-way length memory network
For the method network for extracting feature, two-way attention network is chosen as the core network for generating the degree of correlation, is getting training
After data, the training data that will acquire inputs the composite network being made of two-way length memory network and two-way attention network
In model, the context vector and degree of correlation information of training data are calculated and exported by the composite network model, to pass through instruction
The context vector and degree of correlation information for practicing data determine the target vocabulary in training data.Specifically passing through training data
Context vector and degree of correlation INFORMATION DISCOVERY neologisms.
The vocabulary detection system that second aspect of the present invention proposes chooses two-way length memory network as the side for extracting feature
Method network chooses two-way attention network as the core network for generating the degree of correlation, obtains " degree of correlation " very high neologisms.Specifically
Ground, " degree of correlation " refer to substitute mutually in different context environmentals, are neologisms if it is that can substitute mutually, then
Indicate that the two degree of correlation is very high.For example, " today, I felt hand pain " and " today, I felt that shoulder aches " this two word, among these
" hand pain " and the similarity of " shoulder pain " word vector are very low, but the meaning expressed in this two word is similar, so
" hand pain " and " shoulder pain " degree of correlation are just very high.The present invention is based on the degree of correlation to find the neologisms in a word, to protect
The accurate discovery to neologisms is demonstrate,proved, neologisms required for us are obtained.
Above-mentioned vocabulary detection system according to the present invention, can also have following additional technical feature:
In the above-mentioned technical solutions, it is preferable that processor is specifically used for: during being translated to training data,
Extract the first contextual information in training data;The first context vector of training data is determined according to the first contextual information
With the first degree of correlation information;During carrying out matched to training data and labeled data, extract on the second of training data
Context information;The object vector of training data is determined according to the second contextual information;By the mark of object vector and labeled data
Vector is compared, and records comparison result;The second context vector and the second degree of correlation information are determined according to comparison result.
In the technical scheme, after training data is input to composite network model by processor, Chinese then is carried out to it
And translator of English.Specifically, what is no matter inputted is English sentence or Chinese sentence, is all translated to it, opposite to obtain
The English sentence and Chinese sentence answered;During translation, the contextual information of training data is extracted, obtains first
Contextual information;Then determine that the first context vector of training data and first degree of correlation are believed according to the first contextual information
Breath;Meanwhile after training data is input to composite network model, training data is matched;To training data and mark
During data progress is matched, the second contextual information of training data is extracted, specifically extracting the upper and lower of Chinese sentence
Literary information, to determine the object vector of training data according to the second contextual information;Then by object vector and labeled data
Label-vector is compared, and keeps a record;The second context vector and the second degree of correlation information are finally determined according to comparison result.
In any of the above-described technical solution, it is preferable that processor is specifically used for: the first context vector is related to first
The two-way attention network of information input is spent, to obtain the first calculated result;By the second context vector and the second degree of correlation information
Two-way attention network is inputted, to obtain the second calculated result;Training number is obtained according to the first calculated result, the second calculated result
Target vocabulary in.
In the technical scheme, after the context vector of the training data of acquisition and degree of correlation information, comprehensively consider
The first context vector and the first degree of correlation information obtained in machine translation process, and obtained in sentence matching process
Two context vectors and the second degree of correlation information, to determine target vocabulary according to above-mentioned context vector and degree of correlation information, i.e.,
Neologisms in training data.
It specifically, can before determining the target vocabulary in training data according to the first calculated result, the second calculated result
To analyze the first calculated result and the second calculated result, the higher vocabulary of the degree of correlation is marked, is carried out " highlighted "
Processing, increases weight in attention matrix, so that relevant vocabulary obtains more concerns in two words, training effect is good.
Two relevant vocabulary are exported finally by two-way attention network, one is vocabulary existing in dictionary, and another word is thought for us
It was found that neologisms.
In any of the above-described technical solution, it is preferable that processor is specifically used for: obtaining corpus data;To corpus data into
Row pretreatment, to obtain training data.
In the technical scheme, machine translation semantic similarity data are chosen first, then to the corpus number of different language
It is matched one by one according to doing, removes messy code, disorderly and unsystematic item washes unwanted corpus data, and arranges label correctness.To same
One language degree of correlation training data does sentence (paragraph) matching, the training data needed.
Third aspect present invention proposes a kind of computer readable storage medium, is stored thereon with computer program, calculates
The vocabulary detection method of any one of first aspect present invention is realized when machine program is executed by processor.
The computer readable storage medium that third invention of the present invention proposes, is stored thereon with computer program, computer journey
The vocabulary detection method such as any one of first aspect present invention is realized when sequence is executed by processor, therefore there is the inspection of above-mentioned vocabulary
Whole beneficial effects of survey method, are no longer discussed one by one herein.
Additional aspect and advantage of the invention will become obviously in following description section, or practice through the invention
Recognize.
Detailed description of the invention
Above-mentioned and/or additional aspect of the invention and advantage will become from the description of the embodiment in conjunction with the following figures
Obviously and it is readily appreciated that, in which:
Fig. 1 is the flow chart of the vocabulary detection method of one embodiment of the invention;
Fig. 2 is the flow chart of the vocabulary detection method of a specific embodiment of the invention;
Fig. 3 is the structural block diagram of the vocabulary detection system of one embodiment of the invention.
Specific embodiment
To better understand the objects, features and advantages of the present invention, with reference to the accompanying drawing and specific real
Applying mode, the present invention is further described in detail.It should be noted that in the absence of conflict, the implementation of the application
Feature in example and embodiment can be combined with each other.
In the following description, numerous specific details are set forth in order to facilitate a full understanding of the present invention, still, the present invention may be used also
Implement in a manner of using other than the one described here, therefore, protection scope of the present invention is not by following public tool
The limitation of body embodiment.
The vocabulary detection method proposed according to some embodiments of the invention, vocabulary detection are described referring to Fig. 1 to Fig. 3
System and computer readable storage medium.
Fig. 1 is the flow chart of the vocabulary detection method of one embodiment of the invention.
As shown in Figure 1, the vocabulary detection method includes:
S102 obtains training data;
Training data is inputted composite network model by S104, to obtain the context vector and degree of correlation letter of training data
Breath;
S106, based on context vector and degree of correlation information determine the target vocabulary in training data.
The vocabulary detection method that first aspect present invention proposes chooses two-way length memory network as the side for extracting feature
Method network, choose two-way attention network will acquire after getting training data as the core network for generating the degree of correlation
The composite network model that is made of two-way length memory network and two-way attention network of training data input in, it is multiple by this
Close network model and calculate and export the context vector and degree of correlation information of training data, with by the context of training data to
Amount and degree of correlation information determine the target vocabulary in training data.Specifically passing through the context vector and phase of training data
Pass degree INFORMATION DISCOVERY neologisms.
The vocabulary detection method that first aspect present invention proposes chooses two-way length memory network as the side for extracting feature
Method network chooses two-way attention network as the core network for generating the degree of correlation, obtains " degree of correlation " very high neologisms.Specifically
Ground, " degree of correlation " refer to substitute mutually in different context environmentals, are neologisms if it is that can substitute mutually, then
Indicate that the two degree of correlation is very high.For example, " today, I felt hand pain " and " today, I felt that shoulder aches " this two word, among these
" hand pain " and the similarity of " shoulder pain " word vector are very low, but the meaning expressed in this two word is similar, so
" hand pain " and " shoulder pain " degree of correlation are just very high.The present invention is based on the degree of correlation to find the neologisms in a word, to protect
The accurate discovery to neologisms is demonstrate,proved, neologisms required for us are obtained.
In one embodiment of the invention, it is preferable that training data is inputted into composite network model, to obtain training number
According to context vector and the degree of correlation information the step of, specifically include: during being translated to training data, extract instruction
Practice the first contextual information in data;The first context vector and first of training data is determined according to the first contextual information
Degree of correlation information.
In this embodiment, after training data being input to composite network model, Chinese is then carried out to it and English turns over
It translates.Specifically, what is no matter inputted is English sentence or Chinese sentence, is all translated to it, to obtain corresponding English
Sentence and Chinese sentence;During translation, the contextual information of training data is extracted, obtains the first context letter
Breath;Then the first context vector and the first degree of correlation information of training data are determined according to the first contextual information.
In one embodiment of the invention, it is preferable that training data is inputted into composite network model, to obtain training number
According to context vector and the degree of correlation information the step of, specifically include: matched mistake carried out to training data and labeled data
Cheng Zhong extracts the second contextual information of training data;The object vector of training data is determined according to the second contextual information;It will
Object vector is compared with the label-vector of labeled data, and records comparison result;About second is determined according to comparison result
Literary second degree of correlation of vector sum information.
In this embodiment, after the context vector of the training data of acquisition and degree of correlation information, comprehensively consider in machine
The first context vector and the first degree of correlation information obtained in device translation process, and obtained in sentence matching process second
Context vector and the second degree of correlation information are instructed with determining target vocabulary according to above-mentioned context vector and degree of correlation information
Practice the neologisms in data.
It specifically, can before determining the target vocabulary in training data according to the first calculated result, the second calculated result
To analyze the first calculated result and the second calculated result, the higher vocabulary of the degree of correlation is marked, is carried out " highlighted "
Processing, increases weight in attention matrix, so that relevant vocabulary obtains more concerns in two words, training effect is good.
Two relevant vocabulary are exported finally by two-way attention network, one is vocabulary existing in dictionary, and another word is thought for us
It was found that neologisms.
In one embodiment of the invention, it is preferable that the step of obtaining training data specifically includes: obtaining corpus number
According to;Corpus data is pre-processed, to obtain training data.
In this embodiment, machine translation semantic similarity data are chosen first, then to the corpus data of different language
It does and matches one by one, remove messy code, disorderly and unsystematic item washes unwanted corpus data, and arranges label correctness.To same
Language degree of correlation training data does sentence (paragraph) matching, the training data needed.
Fig. 2 is the flow chart of the vocabulary detection method of a specific embodiment of the invention.
As shown in Fig. 2, the vocabulary detection method includes:
S202 obtains corpus data;
S204 pre-processes corpus data, to obtain training data;
S206 extracts the first contextual information in training data during translating to training data;
S208 determines the first context vector and the first degree of correlation information of training data according to the first contextual information;
S210 extracts the second context of training data during carrying out matched to training data and labeled data
Information;
S212 determines the object vector of training data according to the second contextual information;
Object vector is compared with the label-vector of labeled data, and records comparison result by S214;
S216 determines the second context vector and the second degree of correlation information of training data according to comparison result;
S218, by the first context vector and the two-way attention network of the first degree of correlation information input, to obtain in terms of first
Calculate result;
S220, by the second context vector and the two-way attention network of the second degree of correlation information input, to obtain in terms of second
Calculate result;
S222 determines the target vocabulary in training data according to the first calculated result, the second calculated result.
Vocabulary detection method provided by the specific embodiment, is broadly divided into following process:
Bi-LSTM (two-way length memory network) is chosen as the method network for extracting feature, chooses Bi-attention
(two-way attention network) is as the core network for generating the degree of correlation.
Machine translation semantic similarity data are chosen, the data of different language are done and are matched one by one, messy code, mixed and disorderly nothing are removed
Zhang Xiang washes unwanted language data, and arranges label correctness.Sentence is done to same language degree of correlation training data
(paragraph) matching, the training data needed.
Machine translation is done with seq2seq (coder-decoder) model centering-English data, we only need context
Vector describes the degree of correlation (registration), with machine translation during, contextual information in corpus is extracted, is obtained
Context vector (context vector) and degree of correlation information first carry out one to information and feature using two-way attention model
The extraction of wheel.
In sentence matching, the main contextual information between Chinese, although unlike machine translation is to neural network
Information extraction is deep in model, but based on context can also extract feature vector by mutual information, according to one-to-one comparison
Labeled data do sentence matching, identical sentence is designated as 1, and different sentences is designated as -1, finds the degree of correlation of the inside word.?
Result after machine translation and sentence matching inputs in next step parallel.
Then it is exactly the formula of some neural networks:
Cove (w)=MT-LSTM (GloVe (w))
Wherein, GloVe (w) indicates to indicate word w by the corresponding vector of mapping layer of GloVe (term vector model),
Then this vector table is shown as to the input of Ecoder in Machine Translation Model (code editor), obtained Encoder (is compiled
Code device) output be exactly context vector CoVe.Specifically, context vector CoVe is actually to pass through Machine Translation Model
It can directly obtain.
After carrying out primary compression to corpus and extracting feature, data are transferred to two-way attention network, carry out neologisms hair
It now trains, this network model can handle pairs of sentence also and can handle single sentence.Handle single sentence when
Time just replicates sentence, then as sentence to processing.Model emphasis can the word high to the degree of correlation do at " highlighted "
Reason, can increase weight in attention matrix, so that relevant word obtains more concerns, training effect in two words
It is good.Two relevant words are exported finally by two-way attention network, one is word existing in dictionary, and another word is for we
Want the neologisms of discovery.
In addition it is also possible to Machine Translation Model and attention (attention) model be improved, so that new word discovery result is more
It is good, it can find more neologisms, the degree of correlation is higher.
Meanwhile also there is other methods pre-training context-sensitive vector.Pre-training method is also with contextual information pair
Text vector is described, and then carries out new word discovery with LSTM (length memory network) method.It is identical as this programme, it is all benefit
Corpus of text is modeled with contextual information, although the model method wherein used is entirely different, principle is essentially identical,
Therefore also in protection of the invention.
Second aspect of the present invention proposes a kind of vocabulary detection system 300, as shown in Figure 3, comprising: memory 302 is used for
Store computer program;Processor 304, for execute computer program with: obtain training data;Training data is inputted compound
Network model, to obtain the context vector and degree of correlation information of training data;Based on context vector and degree of correlation information are true
Determine the target vocabulary in training data;Wherein, composite network model is by two-way length memory network and two-way attention group of networks
At.
The vocabulary detection system 300 that second aspect of the present invention proposes includes mutually matched memory 302 and processor
304, locate to be stored with storage computer program on memory 302, processor 304 is double to choose for executing the computer program
To length memory network as the method network for extracting feature, two-way attention network is chosen as the core net for generating the degree of correlation
Network, after getting training data, the training data that will acquire is inputted by two-way length memory network and two-way attention net
In the composite network model of network composition, the context vector and correlation of training data are calculated and exported by the composite network model
Information is spent, to determine the target vocabulary in training data by the context vector of training data and degree of correlation information.Specifically,
It is the context vector and degree of correlation INFORMATION DISCOVERY neologisms by training data.
The vocabulary detection system 300 that second aspect of the present invention proposes chooses two-way length memory network as extraction feature
Method network chooses two-way attention network as the core network for generating the degree of correlation, obtains " degree of correlation " very high neologisms.Tool
Body, it is neologisms if it is that can substitute mutually that " degree of correlation ", which refers to substitute mutually in different context environmentals,
Then indicate that the two degree of correlation is very high.For example, " today I feel hand pain " and " today, I felt shoulder pain " this two word, this its
In " hand pain " and the similarity of " shoulder pain " word vector it is very low, but in this two word expressed by the meaning it is similar, institute
It is just very high with " hand pain " and " shoulder pain " degree of correlation.The present invention is based on the degree of correlation to find the neologisms in a word, with
Guarantee the accurate discovery to neologisms, obtains neologisms required for us.
In one embodiment of the invention, it is preferable that processor 304 is specifically used for: translating to training data
During, extract the first contextual information in training data;The first of training data is determined according to the first contextual information
Context vector and the first degree of correlation information;During carrying out matched to training data and labeled data, training number is extracted
According to the second contextual information;The object vector of training data is determined according to the second contextual information;By object vector and mark
The label-vector of data is compared, and records comparison result;The second context vector and the second phase are determined according to comparison result
Pass degree information.
In this embodiment, after training data is input to composite network model by processor, then it is carried out Chinese and
Translator of English.Specifically, what is no matter inputted is English sentence or Chinese sentence, is all translated to it, corresponding to obtain
English sentence and Chinese sentence;During translation, the contextual information of training data is extracted, is obtained on first
Context information;Then the first context vector and the first degree of correlation information of training data are determined according to the first contextual information;
Meanwhile after training data is input to composite network model, training data is matched;To training data and labeled data
Carry out it is matched during, the second contextual information of training data is extracted, specifically extracting the context letter of Chinese sentence
Breath, to determine the object vector of training data according to the second contextual information;Then by the mark of object vector and labeled data
Vector is compared, and keeps a record;The second context vector and the second degree of correlation information are finally determined according to comparison result.
In one embodiment of the invention, it is preferable that processor 304 is specifically used for: by the first context vector and
The two-way attention network of one degree of correlation information input, to obtain the first calculated result;Second context vector is related to second
The two-way attention network of information input is spent, to obtain the second calculated result;It is obtained according to the first calculated result, the second calculated result
Target vocabulary in training data.
In this embodiment, after the context vector of the training data of acquisition and degree of correlation information, comprehensively consider in machine
The first context vector and the first degree of correlation information obtained in device translation process, and obtained in sentence matching process second
Context vector and the second degree of correlation information are instructed with determining target vocabulary according to above-mentioned context vector and degree of correlation information
Practice the neologisms in data.
It specifically, can before determining the target vocabulary in training data according to the first calculated result, the second calculated result
To analyze the first calculated result and the second calculated result, the higher vocabulary of the degree of correlation is marked, is carried out " highlighted "
Processing, increases weight in attention matrix, so that relevant vocabulary obtains more concerns in two words, training effect is good.
Two relevant vocabulary are exported finally by two-way attention network, one is vocabulary existing in dictionary, and another word is thought for us
It was found that neologisms.
In one embodiment of the invention, it is preferable that processor 304 is specifically used for: obtaining corpus data;To corpus number
According to being pre-processed, to obtain training data.
In this embodiment, machine translation semantic similarity data are chosen first, then to the corpus data of different language
It does and matches one by one, remove messy code, disorderly and unsystematic item washes unwanted corpus data, and arranges label correctness.To same
Language degree of correlation training data does sentence (paragraph) matching, the training data needed.
Third aspect present invention proposes a kind of computer readable storage medium, is stored thereon with computer program, calculates
The vocabulary detection method of any one of first aspect present invention is realized when machine program is executed by processor.
The computer readable storage medium that third invention of the present invention proposes, is stored thereon with computer program, computer journey
The vocabulary detection method such as any one of first aspect present invention is realized when sequence is executed by processor, therefore there is the inspection of above-mentioned vocabulary
Whole beneficial effects of survey method, are no longer discussed one by one herein.
In the description of the present invention, term " multiple " then refers to two or more, unless otherwise restricted clearly, term
The orientation or positional relationship of the instructions such as "upper", "lower" is to be based on the orientation or positional relationship shown in the drawings, and is merely for convenience of retouching
It states the present invention and simplifies description, rather than the device or element of indication or suggestion meaning must have a particular orientation, with specific
Orientation construction and operation, therefore be not considered as limiting the invention;Term " connection ", " installation ", " fixation " etc. should all
It is interpreted broadly, for example, " connection " may be fixed connection or may be dismantle connection, or integral connection;It can be straight
Connect it is connected, can also be indirectly connected through an intermediary.It for the ordinary skill in the art, can be according to specific feelings
Condition understands the concrete meaning of above-mentioned term in the present invention.
In the description of this specification, the description of term " one embodiment ", " some embodiments ", " specific embodiment " etc.
Mean that particular features, structures, materials, or characteristics described in conjunction with this embodiment or example are contained at least one reality of the invention
It applies in example or example.In the present specification, schematic expression of the above terms are not necessarily referring to identical embodiment or reality
Example.Moreover, description particular features, structures, materials, or characteristics can in any one or more of the embodiments or examples with
Suitable mode combines.
These are only the preferred embodiment of the present invention, is not intended to restrict the invention, for those skilled in the art
For member, the invention may be variously modified and varied.All within the spirits and principles of the present invention, it is made it is any modification,
Equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.
Claims (10)
1. a kind of vocabulary detection method characterized by comprising
Obtain training data;
The training data is inputted into composite network model, to obtain the context vector and degree of correlation letter of the training data
Breath;
The target vocabulary in the training data is determined according to the context vector and the degree of correlation information;
Wherein, the composite network model is made of two-way length memory network and two-way attention network.
2. vocabulary detection method according to claim 1, which is characterized in that the training data is inputted composite network mould
Type the step of to obtain the context vector and degree of correlation information of the training data, specifically includes:
During translating to the training data, the first contextual information in the training data is extracted;
The first context vector and the first degree of correlation information of the training data are determined according to first contextual information.
3. vocabulary detection method according to claim 2, which is characterized in that the training data is inputted composite network mould
Type the step of to obtain the context vector and degree of correlation information of the training data, specifically includes:
During carrying out matched to the training data and labeled data, the second context letter of the training data is extracted
Breath;
The object vector of the training data is determined according to second contextual information;
The object vector is compared with the label-vector of the labeled data, and records comparison result;
Determine that the second context vector of the training data is related to the second of the training data according to the comparison result
Spend information.
4. vocabulary detection method according to claim 3, which is characterized in that according to the context vector and the correlation
The step of degree information determines the target vocabulary in the training data, specifically includes:
By two-way attention network described in first context vector and first degree of correlation information input, to obtain first
Calculated result;
By two-way attention network described in second context vector and second degree of correlation information input, to obtain second
Calculated result;
The target vocabulary in the training data is determined according to first calculated result, second calculated result.
5. vocabulary detection method according to any one of claim 1 to 4, which is characterized in that obtain the step of training data
Suddenly, it specifically includes:
Obtain corpus data;
The corpus data is pre-processed, to obtain the training data.
6. a kind of vocabulary detection system characterized by comprising
Memory, for storing computer program;
Processor, for execute the computer program with:
Obtain training data;
The training data is inputted into composite network model, to obtain the context vector and degree of correlation letter of the training data
Breath;
The target vocabulary in the training data is determined according to the context vector and the degree of correlation information;
Wherein, the composite network model is made of two-way length memory network and two-way attention network.
7. vocabulary detection system according to claim 6, which is characterized in that the processor is specifically used for:
During translating to the training data, the first contextual information in the training data is extracted;
The first context vector and the first degree of correlation information of the training data are determined according to first contextual information;
During carrying out matched to the training data and labeled data, the second context letter of the training data is extracted
Breath;
The object vector of the training data is determined according to second contextual information;
The object vector is compared with the label-vector of the labeled data, and records comparison result;
Determine that the second context vector of the training data is related to the second of the training data according to the comparison result
Spend information.
8. vocabulary detection system according to claim 7, which is characterized in that the processor is specifically used for:
By two-way attention network described in first context vector and first degree of correlation information input, to obtain first
Calculated result;
By two-way attention network described in second context vector and second degree of correlation information input, to obtain second
Calculated result;
The target vocabulary in the training data is obtained according to first calculated result, second calculated result.
9. the vocabulary detection system according to any one of claim 6 to 8, which is characterized in that the processing implement body is used
In:
Obtain corpus data;
The corpus data is pre-processed, to obtain the training data.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program
The vocabulary detection method as described in any one of claims 1 to 5 is realized when being executed by processor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910035746.9A CN109902310A (en) | 2019-01-15 | 2019-01-15 | Vocabulary detection method, vocabulary detection system and computer readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910035746.9A CN109902310A (en) | 2019-01-15 | 2019-01-15 | Vocabulary detection method, vocabulary detection system and computer readable storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109902310A true CN109902310A (en) | 2019-06-18 |
Family
ID=66943638
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910035746.9A Pending CN109902310A (en) | 2019-01-15 | 2019-01-15 | Vocabulary detection method, vocabulary detection system and computer readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109902310A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110287300A (en) * | 2019-06-27 | 2019-09-27 | 谷晓佳 | Chinese and English relative words acquisition methods and device |
CN112836523A (en) * | 2019-11-22 | 2021-05-25 | 上海流利说信息技术有限公司 | Word translation method, device and equipment and readable storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107358948A (en) * | 2017-06-27 | 2017-11-17 | 上海交通大学 | Language in-put relevance detection method based on attention model |
CN107368475A (en) * | 2017-07-18 | 2017-11-21 | 中译语通科技(北京)有限公司 | A kind of machine translation method and system based on generation confrontation neutral net |
CN108845990A (en) * | 2018-06-12 | 2018-11-20 | 北京慧闻科技发展有限公司 | Answer selection method, device and electronic equipment based on two-way attention mechanism |
-
2019
- 2019-01-15 CN CN201910035746.9A patent/CN109902310A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107358948A (en) * | 2017-06-27 | 2017-11-17 | 上海交通大学 | Language in-put relevance detection method based on attention model |
CN107368475A (en) * | 2017-07-18 | 2017-11-21 | 中译语通科技(北京)有限公司 | A kind of machine translation method and system based on generation confrontation neutral net |
CN108845990A (en) * | 2018-06-12 | 2018-11-20 | 北京慧闻科技发展有限公司 | Answer selection method, device and electronic equipment based on two-way attention mechanism |
Non-Patent Citations (1)
Title |
---|
PENG ZHOU ET AL.: "Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification", 《PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110287300A (en) * | 2019-06-27 | 2019-09-27 | 谷晓佳 | Chinese and English relative words acquisition methods and device |
CN112836523A (en) * | 2019-11-22 | 2021-05-25 | 上海流利说信息技术有限公司 | Word translation method, device and equipment and readable storage medium |
CN112836523B (en) * | 2019-11-22 | 2022-12-30 | 上海流利说信息技术有限公司 | Word translation method, device and equipment and readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106202010B (en) | Method and apparatus based on deep neural network building Law Text syntax tree | |
KR101923650B1 (en) | System and Method for Sentence Embedding and Similar Question Retrieving | |
CN112990296B (en) | Image-text matching model compression and acceleration method and system based on orthogonal similarity distillation | |
CN110298035A (en) | Word vector based on artificial intelligence defines method, apparatus, equipment and storage medium | |
CN109284397A (en) | A kind of construction method of domain lexicon, device, equipment and storage medium | |
CN104615589A (en) | Named-entity recognition model training method and named-entity recognition method and device | |
KR20200040652A (en) | Natural language processing system and method for word representations in natural language processing | |
CN107463553A (en) | For the text semantic extraction, expression and modeling method and system of elementary mathematics topic | |
JP2002189745A (en) | Multilingual document retrieval system | |
CN110490081A (en) | A kind of remote sensing object decomposition method based on focusing weight matrix and mutative scale semantic segmentation neural network | |
CN111666758A (en) | Chinese word segmentation method, training device and computer readable storage medium | |
Kuriyozov et al. | Cross-lingual word embeddings for Turkic languages | |
CN110969023B (en) | Text similarity determination method and device | |
CN109902310A (en) | Vocabulary detection method, vocabulary detection system and computer readable storage medium | |
CN111401065A (en) | Entity identification method, device, equipment and storage medium | |
CN108021682A (en) | Open information extracts a kind of Entity Semantics method based on wikipedia under background | |
CN114648032B (en) | Training method and device of semantic understanding model and computer equipment | |
CN115795056A (en) | Method, server and storage medium for constructing knowledge graph by unstructured information | |
CN108491399A (en) | Chinese to English machine translation method based on context iterative analysis | |
CN106815215A (en) | The method and apparatus for generating annotation repository | |
Sheshmani et al. | Categorical representation learning: morphism is all you need | |
Sun | [Retracted] Analysis of Chinese Machine Translation Training Based on Deep Learning Technology | |
CN110489740A (en) | Semantic analytic method and Related product | |
CN103514194B (en) | Determine method and apparatus and the classifier training method of the dependency of language material and entity | |
CN110866404B (en) | Word vector generation method and device based on LSTM neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20190618 |