CN109902310A

CN109902310A - Vocabulary detection method, vocabulary detection system and computer readable storage medium

Info

Publication number: CN109902310A
Application number: CN201910035746.9A
Authority: CN
Inventors: 欧阳一村; 程源泉; 曾志辉; 贺涛
Original assignee: ZTE ICT Technologies Co Ltd
Current assignee: ZTE ICT Technologies Co Ltd
Priority date: 2019-01-15
Filing date: 2019-01-15
Publication date: 2019-06-18

Abstract

The invention proposes a kind of vocabulary detection method, vocabulary detection system and computer readable storage mediums, wherein vocabulary detection method includes: acquisition training data；Training data is inputted into composite network model, to obtain the context vector and degree of correlation information of training data；Based on context vector and degree of correlation information determine the target vocabulary in training data；Wherein, composite network model is made of two-way length memory network and two-way attention network.Vocabulary detection method proposed by the present invention chooses two-way length memory network as the method network for extracting feature, two-way attention network is chosen as the core network for generating the degree of correlation, " degree of correlation " very high neologisms are obtained, to guarantee the accurate discovery to neologisms, obtain neologisms required for us.

Description

Vocabulary detection method, vocabulary detection system and computer readable storage medium

Technical field

The present invention relates to field of computer technology, in particular to a kind of vocabulary detection method, a kind of vocabulary detection system System and a kind of computer readable storage medium.

Background technique

In the various fields of Text extraction, new word discovery namely finds new word.In the related technology, it is Using the feature of word in text, neologisms are found out with the similarity of word vector characteristics vector.This new word discovery has a mistake Area is to be difficult to find completely new neologisms, and discovery is all that the words of more words (for example has headache, discovery is new in dictionary mostly Word is had a bad headache) either wrong word (Amoxicillin and Amdinocillin).Because above-mentioned Technology application is word vector model, word to Amount only considers the degree of correlation of word and word in different vocabulary, it is found that word and original word it is substantially similar, can not find us Really desired " neologisms ", adaptability is not high.

Summary of the invention

The present invention is directed at least solve one of the technical problems existing in the prior art.

For this purpose, first aspect present invention proposes a kind of vocabulary detection method.

Second aspect of the present invention proposes a kind of vocabulary detection system.

Third aspect present invention proposes a kind of computer readable storage medium.

First aspect present invention proposes a kind of vocabulary detection method, comprising: obtains training data；Training data is inputted Composite network model, to obtain the context vector and degree of correlation information of training data；Based on context vector and degree of correlation letter Cease the target vocabulary determined in training data；Wherein, composite network model is by two-way length memory network and two-way attention net Network composition.

The vocabulary detection method that first aspect present invention proposes chooses two-way length memory network as the side for extracting feature Method network, choose two-way attention network will acquire after getting training data as the core network for generating the degree of correlation The composite network model that is made of two-way length memory network and two-way attention network of training data input in, it is multiple by this Close network model and calculate and export the context vector and degree of correlation information of training data, with by the context of training data to Amount and degree of correlation information determine the target vocabulary in training data.Specifically passing through the context vector and phase of training data Pass degree INFORMATION DISCOVERY neologisms.

The vocabulary detection method that first aspect present invention proposes chooses two-way length memory network as the side for extracting feature Method network chooses two-way attention network as the core network for generating the degree of correlation, obtains " degree of correlation " very high neologisms.Specifically Ground, " degree of correlation " refer to substitute mutually in different context environmentals, are neologisms if it is that can substitute mutually, then Indicate that the two degree of correlation is very high.For example, " today, I felt hand pain " and " today, I felt that shoulder aches " this two word, among these " hand pain " and the similarity of " shoulder pain " word vector are very low, but the meaning expressed in this two word is similar, so " hand pain " and " shoulder pain " degree of correlation are just very high.The present invention is based on the degree of correlation to find the neologisms in a word, to protect The accurate discovery to neologisms is demonstrate,proved, neologisms required for us are obtained.

Above-mentioned vocabulary detection method according to the present invention, can also have following additional technical feature:

In the above-mentioned technical solutions, it is preferable that training data is inputted into composite network model, to obtain the upper of training data It the step of below vector and degree of correlation information, specifically includes: during translating to training data, extracting training data In the first contextual information；The first context vector and first degree of correlation of training data are determined according to the first contextual information Information.

In the technical scheme, after training data being input to composite network model, Chinese and English are then carried out to it Translation.Specifically, what is no matter inputted is English sentence or Chinese sentence, is all translated to it, to obtain corresponding English Literary sentence and Chinese sentence；During translation, the contextual information of training data is extracted, obtains the first context Information；Then the first context vector and the first degree of correlation information of training data are determined according to the first contextual information.

In any of the above-described technical solution, it is preferable that training data is inputted composite network model, to obtain training data Context vector and the degree of correlation information the step of, specifically include: matched process carried out to training data and labeled data In, extract the second contextual information of training data；The object vector of training data is determined according to the second contextual information；By mesh Mark vector is compared with the label-vector of labeled data, and records comparison result；The second context is determined according to comparison result Vector sum second degree of correlation information.

In the technical scheme, after the context vector of the training data of acquisition and degree of correlation information, comprehensively consider The first context vector and the first degree of correlation information obtained in machine translation process, and obtained in sentence matching process Two context vectors and the second degree of correlation information, to determine target vocabulary according to above-mentioned context vector and degree of correlation information, i.e., Neologisms in training data.

It specifically, can before determining the target vocabulary in training data according to the first calculated result, the second calculated result To analyze the first calculated result and the second calculated result, the higher vocabulary of the degree of correlation is marked, is carried out " highlighted " Processing, increases weight in attention matrix, so that relevant vocabulary obtains more concerns in two words, training effect is good. Two relevant vocabulary are exported finally by two-way attention network, one is vocabulary existing in dictionary, and another word is thought for us It was found that neologisms.

In any of the above-described technical solution, it is preferable that the step of obtaining training data specifically includes: obtaining corpus data； Corpus data is pre-processed, to obtain training data.

In the technical scheme, machine translation semantic similarity data are chosen first, then to the corpus number of different language It is matched one by one according to doing, removes messy code, disorderly and unsystematic item washes unwanted corpus data, and arranges label correctness.To same One language degree of correlation training data does sentence (paragraph) matching, the training data needed.

Second aspect of the present invention proposes a kind of vocabulary detection system, comprising: memory, for storing computer program； Processor, for execute computer program with: obtain training data；Training data is inputted into composite network model, to be instructed Practice the context vector and degree of correlation information of data；Based on context vector and degree of correlation information determine the target in training data Vocabulary；Wherein, composite network model is made of two-way length memory network and two-way attention network.

The vocabulary detection system that second aspect of the present invention proposes includes mutually matched memory and processor, place's storage Storage computer program is stored on device, processor is made for executing the computer program to choose two-way length memory network For the method network for extracting feature, two-way attention network is chosen as the core network for generating the degree of correlation, is getting training After data, the training data that will acquire inputs the composite network being made of two-way length memory network and two-way attention network In model, the context vector and degree of correlation information of training data are calculated and exported by the composite network model, to pass through instruction The context vector and degree of correlation information for practicing data determine the target vocabulary in training data.Specifically passing through training data Context vector and degree of correlation INFORMATION DISCOVERY neologisms.

The vocabulary detection system that second aspect of the present invention proposes chooses two-way length memory network as the side for extracting feature Method network chooses two-way attention network as the core network for generating the degree of correlation, obtains " degree of correlation " very high neologisms.Specifically Ground, " degree of correlation " refer to substitute mutually in different context environmentals, are neologisms if it is that can substitute mutually, then Indicate that the two degree of correlation is very high.For example, " today, I felt hand pain " and " today, I felt that shoulder aches " this two word, among these " hand pain " and the similarity of " shoulder pain " word vector are very low, but the meaning expressed in this two word is similar, so " hand pain " and " shoulder pain " degree of correlation are just very high.The present invention is based on the degree of correlation to find the neologisms in a word, to protect The accurate discovery to neologisms is demonstrate,proved, neologisms required for us are obtained.

Above-mentioned vocabulary detection system according to the present invention, can also have following additional technical feature:

In the above-mentioned technical solutions, it is preferable that processor is specifically used for: during being translated to training data, Extract the first contextual information in training data；The first context vector of training data is determined according to the first contextual information With the first degree of correlation information；During carrying out matched to training data and labeled data, extract on the second of training data Context information；The object vector of training data is determined according to the second contextual information；By the mark of object vector and labeled data Vector is compared, and records comparison result；The second context vector and the second degree of correlation information are determined according to comparison result.

In the technical scheme, after training data is input to composite network model by processor, Chinese then is carried out to it And translator of English.Specifically, what is no matter inputted is English sentence or Chinese sentence, is all translated to it, opposite to obtain The English sentence and Chinese sentence answered；During translation, the contextual information of training data is extracted, obtains first Contextual information；Then determine that the first context vector of training data and first degree of correlation are believed according to the first contextual information Breath；Meanwhile after training data is input to composite network model, training data is matched；To training data and mark During data progress is matched, the second contextual information of training data is extracted, specifically extracting the upper and lower of Chinese sentence Literary information, to determine the object vector of training data according to the second contextual information；Then by object vector and labeled data Label-vector is compared, and keeps a record；The second context vector and the second degree of correlation information are finally determined according to comparison result.

In any of the above-described technical solution, it is preferable that processor is specifically used for: the first context vector is related to first The two-way attention network of information input is spent, to obtain the first calculated result；By the second context vector and the second degree of correlation information Two-way attention network is inputted, to obtain the second calculated result；Training number is obtained according to the first calculated result, the second calculated result Target vocabulary in.

In any of the above-described technical solution, it is preferable that processor is specifically used for: obtaining corpus data；To corpus data into Row pretreatment, to obtain training data.

Third aspect present invention proposes a kind of computer readable storage medium, is stored thereon with computer program, calculates The vocabulary detection method of any one of first aspect present invention is realized when machine program is executed by processor.

The computer readable storage medium that third invention of the present invention proposes, is stored thereon with computer program, computer journey The vocabulary detection method such as any one of first aspect present invention is realized when sequence is executed by processor, therefore there is the inspection of above-mentioned vocabulary Whole beneficial effects of survey method, are no longer discussed one by one herein.

Additional aspect and advantage of the invention will become obviously in following description section, or practice through the invention Recognize.

Detailed description of the invention

Above-mentioned and/or additional aspect of the invention and advantage will become from the description of the embodiment in conjunction with the following figures Obviously and it is readily appreciated that, in which:

Fig. 1 is the flow chart of the vocabulary detection method of one embodiment of the invention；

Fig. 2 is the flow chart of the vocabulary detection method of a specific embodiment of the invention；

Fig. 3 is the structural block diagram of the vocabulary detection system of one embodiment of the invention.

Specific embodiment

To better understand the objects, features and advantages of the present invention, with reference to the accompanying drawing and specific real Applying mode, the present invention is further described in detail.It should be noted that in the absence of conflict, the implementation of the application Feature in example and embodiment can be combined with each other.

In the following description, numerous specific details are set forth in order to facilitate a full understanding of the present invention, still, the present invention may be used also Implement in a manner of using other than the one described here, therefore, protection scope of the present invention is not by following public tool The limitation of body embodiment.

The vocabulary detection method proposed according to some embodiments of the invention, vocabulary detection are described referring to Fig. 1 to Fig. 3 System and computer readable storage medium.

Fig. 1 is the flow chart of the vocabulary detection method of one embodiment of the invention.

As shown in Figure 1, the vocabulary detection method includes:

S102 obtains training data；

Training data is inputted composite network model by S104, to obtain the context vector and degree of correlation letter of training data Breath；

S106, based on context vector and degree of correlation information determine the target vocabulary in training data.

In one embodiment of the invention, it is preferable that training data is inputted into composite network model, to obtain training number According to context vector and the degree of correlation information the step of, specifically include: during being translated to training data, extract instruction Practice the first contextual information in data；The first context vector and first of training data is determined according to the first contextual information Degree of correlation information.

In this embodiment, after training data being input to composite network model, Chinese is then carried out to it and English turns over It translates.Specifically, what is no matter inputted is English sentence or Chinese sentence, is all translated to it, to obtain corresponding English Sentence and Chinese sentence；During translation, the contextual information of training data is extracted, obtains the first context letter Breath；Then the first context vector and the first degree of correlation information of training data are determined according to the first contextual information.

In one embodiment of the invention, it is preferable that training data is inputted into composite network model, to obtain training number According to context vector and the degree of correlation information the step of, specifically include: matched mistake carried out to training data and labeled data Cheng Zhong extracts the second contextual information of training data；The object vector of training data is determined according to the second contextual information；It will Object vector is compared with the label-vector of labeled data, and records comparison result；About second is determined according to comparison result Literary second degree of correlation of vector sum information.

In this embodiment, after the context vector of the training data of acquisition and degree of correlation information, comprehensively consider in machine The first context vector and the first degree of correlation information obtained in device translation process, and obtained in sentence matching process second Context vector and the second degree of correlation information are instructed with determining target vocabulary according to above-mentioned context vector and degree of correlation information Practice the neologisms in data.

In one embodiment of the invention, it is preferable that the step of obtaining training data specifically includes: obtaining corpus number According to；Corpus data is pre-processed, to obtain training data.

In this embodiment, machine translation semantic similarity data are chosen first, then to the corpus data of different language It does and matches one by one, remove messy code, disorderly and unsystematic item washes unwanted corpus data, and arranges label correctness.To same Language degree of correlation training data does sentence (paragraph) matching, the training data needed.

Fig. 2 is the flow chart of the vocabulary detection method of a specific embodiment of the invention.

As shown in Fig. 2, the vocabulary detection method includes:

S202 obtains corpus data；

S204 pre-processes corpus data, to obtain training data；

S206 extracts the first contextual information in training data during translating to training data；

S208 determines the first context vector and the first degree of correlation information of training data according to the first contextual information；

S210 extracts the second context of training data during carrying out matched to training data and labeled data Information；

S212 determines the object vector of training data according to the second contextual information；

Object vector is compared with the label-vector of labeled data, and records comparison result by S214；

S216 determines the second context vector and the second degree of correlation information of training data according to comparison result；

S218, by the first context vector and the two-way attention network of the first degree of correlation information input, to obtain in terms of first Calculate result；

S220, by the second context vector and the two-way attention network of the second degree of correlation information input, to obtain in terms of second Calculate result；

S222 determines the target vocabulary in training data according to the first calculated result, the second calculated result.

Vocabulary detection method provided by the specific embodiment, is broadly divided into following process:

Bi-LSTM (two-way length memory network) is chosen as the method network for extracting feature, chooses Bi-attention (two-way attention network) is as the core network for generating the degree of correlation.

Machine translation semantic similarity data are chosen, the data of different language are done and are matched one by one, messy code, mixed and disorderly nothing are removed Zhang Xiang washes unwanted language data, and arranges label correctness.Sentence is done to same language degree of correlation training data (paragraph) matching, the training data needed.

Machine translation is done with seq2seq (coder-decoder) model centering-English data, we only need context Vector describes the degree of correlation (registration), with machine translation during, contextual information in corpus is extracted, is obtained Context vector (context vector) and degree of correlation information first carry out one to information and feature using two-way attention model The extraction of wheel.

In sentence matching, the main contextual information between Chinese, although unlike machine translation is to neural network Information extraction is deep in model, but based on context can also extract feature vector by mutual information, according to one-to-one comparison Labeled data do sentence matching, identical sentence is designated as 1, and different sentences is designated as -1, finds the degree of correlation of the inside word.? Result after machine translation and sentence matching inputs in next step parallel.

Then it is exactly the formula of some neural networks:

Cove (w)=MT-LSTM (GloVe (w))

Wherein, GloVe (w) indicates to indicate word w by the corresponding vector of mapping layer of GloVe (term vector model), Then this vector table is shown as to the input of Ecoder in Machine Translation Model (code editor), obtained Encoder (is compiled Code device) output be exactly context vector CoVe.Specifically, context vector CoVe is actually to pass through Machine Translation Model It can directly obtain.

After carrying out primary compression to corpus and extracting feature, data are transferred to two-way attention network, carry out neologisms hair It now trains, this network model can handle pairs of sentence also and can handle single sentence.Handle single sentence when Time just replicates sentence, then as sentence to processing.Model emphasis can the word high to the degree of correlation do at " highlighted " Reason, can increase weight in attention matrix, so that relevant word obtains more concerns, training effect in two words It is good.Two relevant words are exported finally by two-way attention network, one is word existing in dictionary, and another word is for we Want the neologisms of discovery.

In addition it is also possible to Machine Translation Model and attention (attention) model be improved, so that new word discovery result is more It is good, it can find more neologisms, the degree of correlation is higher.

Meanwhile also there is other methods pre-training context-sensitive vector.Pre-training method is also with contextual information pair Text vector is described, and then carries out new word discovery with LSTM (length memory network) method.It is identical as this programme, it is all benefit Corpus of text is modeled with contextual information, although the model method wherein used is entirely different, principle is essentially identical, Therefore also in protection of the invention.

Second aspect of the present invention proposes a kind of vocabulary detection system 300, as shown in Figure 3, comprising: memory 302 is used for Store computer program；Processor 304, for execute computer program with: obtain training data；Training data is inputted compound Network model, to obtain the context vector and degree of correlation information of training data；Based on context vector and degree of correlation information are true Determine the target vocabulary in training data；Wherein, composite network model is by two-way length memory network and two-way attention group of networks At.

The vocabulary detection system 300 that second aspect of the present invention proposes includes mutually matched memory 302 and processor 304, locate to be stored with storage computer program on memory 302, processor 304 is double to choose for executing the computer program To length memory network as the method network for extracting feature, two-way attention network is chosen as the core net for generating the degree of correlation Network, after getting training data, the training data that will acquire is inputted by two-way length memory network and two-way attention net In the composite network model of network composition, the context vector and correlation of training data are calculated and exported by the composite network model Information is spent, to determine the target vocabulary in training data by the context vector of training data and degree of correlation information.Specifically, It is the context vector and degree of correlation INFORMATION DISCOVERY neologisms by training data.

The vocabulary detection system 300 that second aspect of the present invention proposes chooses two-way length memory network as extraction feature Method network chooses two-way attention network as the core network for generating the degree of correlation, obtains " degree of correlation " very high neologisms.Tool Body, it is neologisms if it is that can substitute mutually that " degree of correlation ", which refers to substitute mutually in different context environmentals, Then indicate that the two degree of correlation is very high.For example, " today I feel hand pain " and " today, I felt shoulder pain " this two word, this its In " hand pain " and the similarity of " shoulder pain " word vector it is very low, but in this two word expressed by the meaning it is similar, institute It is just very high with " hand pain " and " shoulder pain " degree of correlation.The present invention is based on the degree of correlation to find the neologisms in a word, with Guarantee the accurate discovery to neologisms, obtains neologisms required for us.

In one embodiment of the invention, it is preferable that processor 304 is specifically used for: translating to training data During, extract the first contextual information in training data；The first of training data is determined according to the first contextual information Context vector and the first degree of correlation information；During carrying out matched to training data and labeled data, training number is extracted According to the second contextual information；The object vector of training data is determined according to the second contextual information；By object vector and mark The label-vector of data is compared, and records comparison result；The second context vector and the second phase are determined according to comparison result Pass degree information.

In this embodiment, after training data is input to composite network model by processor, then it is carried out Chinese and Translator of English.Specifically, what is no matter inputted is English sentence or Chinese sentence, is all translated to it, corresponding to obtain English sentence and Chinese sentence；During translation, the contextual information of training data is extracted, is obtained on first Context information；Then the first context vector and the first degree of correlation information of training data are determined according to the first contextual information； Meanwhile after training data is input to composite network model, training data is matched；To training data and labeled data Carry out it is matched during, the second contextual information of training data is extracted, specifically extracting the context letter of Chinese sentence Breath, to determine the object vector of training data according to the second contextual information；Then by the mark of object vector and labeled data Vector is compared, and keeps a record；The second context vector and the second degree of correlation information are finally determined according to comparison result.

In one embodiment of the invention, it is preferable that processor 304 is specifically used for: by the first context vector and The two-way attention network of one degree of correlation information input, to obtain the first calculated result；Second context vector is related to second The two-way attention network of information input is spent, to obtain the second calculated result；It is obtained according to the first calculated result, the second calculated result Target vocabulary in training data.

In one embodiment of the invention, it is preferable that processor 304 is specifically used for: obtaining corpus data；To corpus number According to being pre-processed, to obtain training data.

In the description of the present invention, term " multiple " then refers to two or more, unless otherwise restricted clearly, term The orientation or positional relationship of the instructions such as "upper", "lower" is to be based on the orientation or positional relationship shown in the drawings, and is merely for convenience of retouching It states the present invention and simplifies description, rather than the device or element of indication or suggestion meaning must have a particular orientation, with specific Orientation construction and operation, therefore be not considered as limiting the invention；Term " connection ", " installation ", " fixation " etc. should all It is interpreted broadly, for example, " connection " may be fixed connection or may be dismantle connection, or integral connection；It can be straight Connect it is connected, can also be indirectly connected through an intermediary.It for the ordinary skill in the art, can be according to specific feelings Condition understands the concrete meaning of above-mentioned term in the present invention.

In the description of this specification, the description of term " one embodiment ", " some embodiments ", " specific embodiment " etc. Mean that particular features, structures, materials, or characteristics described in conjunction with this embodiment or example are contained at least one reality of the invention It applies in example or example.In the present specification, schematic expression of the above terms are not necessarily referring to identical embodiment or reality Example.Moreover, description particular features, structures, materials, or characteristics can in any one or more of the embodiments or examples with Suitable mode combines.

These are only the preferred embodiment of the present invention, is not intended to restrict the invention, for those skilled in the art For member, the invention may be variously modified and varied.All within the spirits and principles of the present invention, it is made it is any modification, Equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.

Claims

1. a kind of vocabulary detection method characterized by comprising

Obtain training data；

The training data is inputted into composite network model, to obtain the context vector and degree of correlation letter of the training data Breath；

The target vocabulary in the training data is determined according to the context vector and the degree of correlation information；

Wherein, the composite network model is made of two-way length memory network and two-way attention network.

2. vocabulary detection method according to claim 1, which is characterized in that the training data is inputted composite network mould Type the step of to obtain the context vector and degree of correlation information of the training data, specifically includes:

During translating to the training data, the first contextual information in the training data is extracted；

The first context vector and the first degree of correlation information of the training data are determined according to first contextual information.

3. vocabulary detection method according to claim 2, which is characterized in that the training data is inputted composite network mould Type the step of to obtain the context vector and degree of correlation information of the training data, specifically includes:

During carrying out matched to the training data and labeled data, the second context letter of the training data is extracted Breath；

The object vector of the training data is determined according to second contextual information；

The object vector is compared with the label-vector of the labeled data, and records comparison result；

Determine that the second context vector of the training data is related to the second of the training data according to the comparison result Spend information.

4. vocabulary detection method according to claim 3, which is characterized in that according to the context vector and the correlation The step of degree information determines the target vocabulary in the training data, specifically includes:

By two-way attention network described in first context vector and first degree of correlation information input, to obtain first Calculated result；

By two-way attention network described in second context vector and second degree of correlation information input, to obtain second Calculated result；

The target vocabulary in the training data is determined according to first calculated result, second calculated result.

5. vocabulary detection method according to any one of claim 1 to 4, which is characterized in that obtain the step of training data Suddenly, it specifically includes:

Obtain corpus data；

The corpus data is pre-processed, to obtain the training data.

6. a kind of vocabulary detection system characterized by comprising

Memory, for storing computer program；

Processor, for execute the computer program with:

Obtain training data；

7. vocabulary detection system according to claim 6, which is characterized in that the processor is specifically used for:

The first context vector and the first degree of correlation information of the training data are determined according to first contextual information；

8. vocabulary detection system according to claim 7, which is characterized in that the processor is specifically used for:

The target vocabulary in the training data is obtained according to first calculated result, second calculated result.

9. the vocabulary detection system according to any one of claim 6 to 8, which is characterized in that the processing implement body is used In:

Obtain corpus data；

The corpus data is pre-processed, to obtain the training data.

10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program The vocabulary detection method as described in any one of claims 1 to 5 is realized when being executed by processor.