CN110147550A

CN110147550A - Pronunciation character fusion method neural network based

Info

Publication number: CN110147550A
Application number: CN201910327655.2A
Authority: CN
Inventors: 李华康; 王磊; 孔令军; 孙国梓
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University; Nanjing University of Posts and Telecommunications
Priority date: 2019-04-23
Filing date: 2019-04-23
Publication date: 2019-08-20

Abstract

Present invention discloses a kind of pronunciation character fusion methods neural network based, include the following steps, S1, dataset acquisition step, obtain text data；S2, data prediction step, pre-process text data, remove the noise for including in text data, text data is then converted to phonetic symbol or phonetic text；S3, text vector step, the text term vector and pronunciation vector of training text data, and be vectorization data by text data and corresponding phonetic symbol or phonetic text conversion；Vectorization data are input in two neural networks the sequence information for being trained, distinguishing learning text, finally merge the text part of speech feature information learnt and pronunciation character information by S4, neural metwork training and fusion steps.By the present invention in that being enriched the expression of text with the pronunciation character of text and the mode that blends it with text part of speech feature, being improved the result of natural language processing task.

Description

Pronunciation character fusion method neural network based

Technical field

The present invention relates to a kind of Feature fusions, and in particular to a kind of pronunciation character fusion side neural network based Method belongs to natural language processing field.

Background technique

With flourishing for internet, especially in recent years, largely such as microblogging, wechat, the social media of electric business Appearance, internet is by leaps and bounds to have strided forward the Web2.0 epoch, its own has also been transformed to " interaction from " read formula internet " Formula internet ".Netizens have been not only the channel for being regarded as obtaining information to the cognition of internet, even more as biography The platform broadcast information, share oneself viewpoint and emotion.Daily, hundreds of millions of netizens can issue the letter of simultaneously spread index grade Breath, in the information of these magnanimity, the overwhelming majority expresses the viewpoint and hobby of publisher itself.In nowadays this big number According to epoch, these viewpoints are all undoubtedly extremely valuable wealth, wherein include different people to different society field and The subjectivity view of phenomenon.Based on this, either businessman or social organization are again or individual increasingly payes attention on network Text information is carried out analysis by the comment of user and has become one particularly important development of natural language processing field Direction.

In current most of text analyzing model, the input of model is only the semantic meaning representation of word, they ignore The bulk information for including in text, can not accurately indicate urtext, therefore single features are only used only and carry out nature language The effect is unsatisfactory for speech processing.And in language, pronunciation often also contains the emotion information of text.If it is possible to Pronunciation character is collected processing, forms abundant in content Deep Semantics feature, then natural language processing will be improved Effect.

In conclusion a kind of fusion method of pronunciation character how is proposed on the basis of existing technology, thus fully Using text information, also just become the common goal in research of those skilled in that art.

Summary of the invention

In view of the prior art, there are drawbacks described above, and the purpose of the present invention is to propose to a kind of pronunciation characters neural network based Fusion method includes the following steps:

S1, dataset acquisition step obtain text data；

S2, data prediction step, pre-process text data collected, remove the noise for including in text data, Obtained text data is then converted into phonetic symbol text or phonetic text；

S3, text vector step, the text term vector and pronunciation vector of training text data, and will be acquired in previous step Text data and corresponding phonetic symbol text or phonetic text conversion be vectorization data；

Vectorization data obtained in previous step are input to two neural networks by S4, neural metwork training and fusion steps In be trained, respectively learning text sequence information, finally by the text part of speech feature information and pronunciation character that learns letter Breath is merged.

Preferably, dataset acquisition step described in S1, specifically includes: downloading publicly-owned data set or crawls network using crawler Text is formed multiple data sets based on the data downloaded or crawled, and summarizes whole set of data, forms textual data According to.

Preferably, data prediction step described in S2, specifically includes:

S21, pending data collection is chosen from text data and executes subsequent step；

S22, the text data for judging that pending data is concentrated then are held for Chinese text or English text if Chinese text Row S23 is then jumped if English data and is executed S26；

S23, noise removal process is carried out to the Chinese text in data set to be processed, the noise removal process includes format mark Standardization, removal additional character and punctuation mark；

S24, the Chinese text that noise removal process is completed is segmented and stop words is gone to handle；

S25, pinyin marking is carried out to the Chinese text that participle is completed and stop words is gone to handle, obtains the spelling of urtext data Sound text；

S26, noise removal process is carried out to the English text in data set to be processed, the noise removal process includes format mark Standardization, removal additional character and punctuation mark；

S27, speech reduction processing is carried out to the English text that noise removal process is completed；

S28, phonetic symbol mark is carried out to the English text that speech reduction processing is completed, obtains the phonetic symbol text of urtext data This.

Preferably, when carrying out word segmentation processing described in S24, participle tool is stammerer participle, SnowNLP or THULAC； When carrying out going stop words to handle described in S24, deactivating vocabulary is that Harbin Institute of Technology deactivates vocabulary, Baidu deactivates vocabulary or Sichuan University Machine intelligence laboratory deactivates dictionary.

Preferably, when carrying out the mark of phonetic symbol described in S28, matching method mark is carried out using english dictionary or uses crawler It is labeled using network English dictionary.

Preferably, text vector step described in S3, specifically includes:

Pretreated text data is passed through in S31, selection；

S32, training text term vector and pronunciation vector；

S33, using text term vector obtained in S32 and pronunciation vector, by selected in S31 by pretreated text Data and corresponding phonetic symbol text or phonetic text conversion are the data of matrix form, and the data of the matrix form include text Term vector matrix and pronunciation vector matrix.

Preferably, text term vector described in S32 is the text term vector resource shared on network, or uses the big of downloading Scale corpus cooperates experimental data set, utilizes the resulting text term vector of Word2Vec or GloVe training；It is sent out described in S32 Sound vector is trained using Word2Vec or GloVe；The text term vector is identical as the dimension of both pronunciation vectors.

Preferably, neural metwork training and fusion steps described in S4, specifically includes:

S41, the text term vector matrix for reading text data, and by the text term vector Input matrix read to neural network In be trained；

S42, read text data pronunciation vector matrix, and by the pronunciation vector matrix read be input in neural network into Row training；

S43, to passing through two neural metwork trainings obtained text part of speech feature information and pronunciation character information in S41 and S42 Carry out fusion treatment.

Preferably, neural network used in S41 and S42 is RNN, CNN, LSTM or Bi-LSTM；Institute in S41 and S42 The parameter of both neural networks used is identical.

Preferably, fusion treatment described in S43, amalgamation mode are splicing, are added or averagely merge.

Compared with prior art, advantages of the present invention is mainly reflected in the following aspects:

Pronunciation character fusion method neural network based of the invention, by using the pronunciation character of text and by itself and text The mode that part of speech feature blends enriches the expression of text, improves including text classification and natural language analysis The result of a variety of natural language processing tasks.

Meanwhile the present invention also provides reference for other relevant issues in same domain, can be opened up on this basis Extension is stretched, and is applied in other related art schemes in terms of natural language processing, has very wide application prospect.

Just attached drawing in conjunction with the embodiments below, the embodiment of the present invention is described in further detail, so that of the invention Technical solution is more readily understood, grasps.

Detailed description of the invention

Fig. 1 is the overall flow schematic diagram of the method for the present invention；

Fig. 2 is the flow diagram of data prediction step in the present invention；

Fig. 3 is the flow diagram of text vectorization step in the present invention；

Fig. 4 is the flow diagram of neural metwork training and fusion steps in the present invention.

Specific embodiment

The present invention is directed in current most of text analyzing models and inputs only using only part of speech feature, has ignored in text This problem of the other information for including, proposes a kind of pronunciation character fusion method neural network based, and the present invention is using It joined pronunciation character while semantic term vector and indicated with vector, take full advantage of the information of text, form fusion There is the Deep Semantics feature of text term vector and the vector that pronounces.Just method of the invention is illustrated in conjunction with attached drawing below.

As shown in Figure 1, a kind of pronunciation character fusion method neural network based, includes the following steps:

S1, dataset acquisition step obtain text data.

S2, data prediction step, pre-process text data collected, and include in removal text data makes an uproar Obtained text data is then converted to phonetic symbol text or phonetic text by sound.

S3, text vector step, the text term vectors of training text data and pronunciation vector, and by institute in previous step Obtained text data and corresponding phonetic symbol text or phonetic text conversion is vectorization data.

Vectorization data obtained in previous step are input to two nerves by S4, neural metwork training and fusion steps It is trained in network, the sequence information of difference learning text, it is finally that the text part of speech feature information learnt and pronunciation is special Reference breath is merged.

Dataset acquisition step described in S1, specifically includes: it downloads publicly-owned data set or crawls network text using crawler, with Multiple data sets are formed based on the data downloaded or crawled, and summarize whole set of data, form text data.

As shown in Fig. 2, data prediction step described in S2, specifically includes:

S21, pending data collection is chosen from text data and executes subsequent step.

S22, judge text data that pending data is concentrated for Chinese text or English text, if Chinese text, Then execute S23；If English data, then jumps and execute S26.

S23, noise removal process is carried out to the Chinese text in data set to be processed, the noise removal process includes lattice Formula standardization, removal additional character and punctuation mark etc..

S24, the Chinese text that noise removal process is completed is segmented and stop words is gone to handle.Carrying out described point When word processing, participle tool is stammerer participle, SnowNLP or THULAC etc.；When going stop words to handle described in the progress, stop words Table is that Harbin Institute of Technology deactivates vocabulary, Baidu deactivates vocabulary or Sichuan University's machine intelligence laboratory deactivates dictionary etc..

S25, pinyin marking is carried out to the Chinese text that participle is completed and stop words is gone to handle, obtains urtext data Phonetic text, the step for can be realized by the library the Python such as xpinyin, pypinyin.

S26, noise removal process is carried out to the English text in data set to be processed, the noise removal process includes lattice Formula standardization, removal additional character and punctuation mark etc..

S27, speech reduction processing is carried out to the English text that noise removal process is completed, it is right in subsequent step to facilitate The phonetic symbol of English text marks, and the speech reduction processing can be realized by the WordNet dictionary in the library NLTK in Python.

S28, phonetic symbol mark is carried out to the English text that speech reduction processing is completed, obtains the phonetic symbol of urtext data Text.When carrying out phonetic symbol mark, it english dictionary can be used to carry out matching method mark or utilize network English using crawler Dictionary is labeled.

As shown in figure 3, text vector step described in S3, specifically includes:

Pretreated text data is passed through in S31, selection.

S32, training text term vector and pronunciation vector.

The text term vector is the text term vector resource shared on network, Chinese with good grounds Wiki corpus training Text term vector, the English Glove text term vector series for having Stamford training；Or matched using the large-scale corpus collection of downloading It closes experimental data set, utilize the resulting text term vector of Word2Vec or GloVe training；The pronunciation vector uses Word2Vec Or GloVe is trained.The text term vector is identical as the dimension of both pronunciation vectors.

As shown in figure 4, neural metwork training and fusion steps described in S4, specifically include:

S41, the text term vector matrix for reading text data, and by the text term vector Input matrix read to neural network In be trained.

S42, the pronunciation vector matrix for reading text data, and the pronunciation vector matrix read is input to neural network In be trained.

It should be noted that neural network used in S41 and S42 is RNN, CNN, LSTM or Bi-LSTM etc.；S41 And the parameter of both neural networks used in S42 is identical.

S43, to passing through the obtained text part of speech feature information of two neural metwork trainings and pronunciation character in S41 and S42 Information carries out fusion treatment.Fusion treatment described herein, amalgamation mode are splicing, addition or average fusion etc..

In conclusion pronunciation character fusion method neural network based of the invention, special by using the pronunciation of text Levy and by the mode that it is blended with text part of speech feature, enrich the expression of text, improve including text classification and nature The result of a variety of natural language processing tasks including language analysis.

It is obvious to a person skilled in the art that invention is not limited to the details of the above exemplary embodiments, Er Qie In the case where without departing substantially from spirit and essential characteristics of the invention, the present invention can be realized in other specific forms.Therefore, no matter From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the present invention is by appended power Benefit requires rather than above description limits, it is intended that all by what is fallen within the meaning and scope of the equivalent elements of the claims Variation is included within the present invention, and any reference signs in the claims should not be construed as limiting the involved claims.

In addition, it should be understood that although this specification is described in terms of embodiments, but not each embodiment is only wrapped Containing an independent technical solution, this description of the specification is merely for the sake of clarity, and those skilled in the art should It considers the specification as a whole, the technical solutions in the various embodiments may also be suitably combined, forms those skilled in the art The other embodiments being understood that.

Claims

1. a kind of pronunciation character fusion method neural network based, which comprises the steps of:

S1, dataset acquisition step obtain text data；

2. pronunciation character fusion method neural network based according to claim 1, which is characterized in that data described in S1 Collect acquisition step, specifically includes: downloading publicly-owned data set or crawl network text using crawler, with the number downloaded or crawled Multiple data sets are formed based on, and summarize whole set of data, form text data.

3. pronunciation character fusion method neural network based according to claim 2, which is characterized in that data described in S2 Pre-treatment step specifically includes:

4. pronunciation character fusion method neural network based according to claim 3, it is characterised in that: carrying out S24 Described in word segmentation processing when, participle tool is stammerer participle, SnowNLP or THULAC；Gone at stop words described in S24 When reason, deactivating vocabulary is that Harbin Institute of Technology deactivates vocabulary, Baidu deactivates vocabulary or Sichuan University's machine intelligence laboratory deactivates dictionary.

5. pronunciation character fusion method neural network based according to claim 3, it is characterised in that: carrying out S28 Described in phonetic symbol mark when, carry out matching method mark using english dictionary or marked using network English dictionary using crawler Note.

6. pronunciation character fusion method neural network based according to claim 3, which is characterized in that text described in S3 Vectorization step, specifically includes:

Pretreated text data is passed through in S31, selection；

S32, training text term vector and pronunciation vector；

7. pronunciation character fusion method neural network based according to claim 6, it is characterised in that: described in S32 Text term vector is the text term vector resource shared on network, or cooperates experimental data using the large-scale corpus collection of downloading Collection utilizes the resulting text term vector of Word2Vec or GloVe training；Pronunciation vector described in S32 using Word2Vec or GloVe is trained；The text term vector is identical as the dimension of both pronunciation vectors.

8. pronunciation character fusion method neural network based according to claim 6, which is characterized in that nerve described in S4 Network training and fusion steps, specifically include:

9. pronunciation character fusion method neural network based according to claim 8, it is characterised in that: in S41 and S42 Used neural network is RNN, CNN, LSTM or Bi-LSTM；The parameter of both neural networks used in S41 and S42 It is identical.

10. pronunciation character fusion method neural network based according to claim 8, which is characterized in that described in S43 Fusion treatment, amalgamation mode are splicing, are added or averagely merge.