CN110222329A

CN110222329A - A kind of Chinese word cutting method and device based on deep learning

Info

Publication number: CN110222329A
Application number: CN201910322127.8A
Authority: CN
Inventors: 陈闽川; 马骏; 王少军
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-04-22
Filing date: 2019-04-22
Publication date: 2019-09-10
Anticipated expiration: 2039-04-22
Also published as: WO2020215694A1; JP7178513B2; JP2022530447A; CN110222329B; SG11202111464WA

Abstract

The embodiment of the invention provides a kind of Chinese word cutting method and device based on deep learning.The present invention relates to field of artificial intelligence, this method comprises: training corpus data to be converted to the data of character level；The data of character level are converted into sequence data；Sequence data is subjected to cutting according to predetermined symbol, obtains multiple subsequence data, multiple subsequence data are grouped according to the length of subsequence data, obtain K data acquisition system；Timing convolutional neural networks-conditional random field models according to K data acquisition system, after obtaining K training；The data of target corpus data after treatment are inputted into timing convolutional neural networks-conditional random field models after the training of at least one of timing convolutional neural networks-conditional random field models after K training, obtain the word segmentation result of target corpus data.Therefore, technical solution provided in an embodiment of the present invention is able to solve the problem that Chinese word segmentation accuracy is low in the prior art.

Description

A kind of Chinese word cutting method and device based on deep learning

[technical field]

The present invention relates to field of artificial intelligence more particularly to a kind of Chinese word cutting methods and dress based on deep learning It sets.

[background technique]

Deep learning Chinese Word Automatic Segmentation is based primarily upon the circulation nerve net with long short-term memory (LSTM) for representative at present Network model and its derivative model, but processing capacity of the LSTM model in sequence data problem with the increase of sequence length and Decline, has that Chinese word segmentation accuracy is low.

[summary of the invention]

In view of this, the embodiment of the invention provides a kind of Chinese word cutting method and device based on deep learning, to Solve the problems, such as that Chinese word segmentation accuracy is low in the prior art.

On the one hand, the embodiment of the invention provides a kind of Chinese word cutting methods based on deep learning, which comprises Training corpus data are converted to the data of character level；The data of the character level are converted into sequence data；According to default symbol Number the sequence data is subjected to cutting, multiple subsequence data is obtained, according to the length of subsequence data by the multiple son Sequence data is grouped, and obtains K data acquisition system, the subsequence that each data acquisition system in the K data acquisition system includes The equal length of data, K are the natural number greater than 1；Multiple subsequence data are extracted from i-th of data acquisition system and by extraction The multiple subsequence data input in i-th of timing convolutional neural networks-conditional random field models, when training described i-th Sequence convolutional neural networks-conditional random field models, i-th of timing convolutional neural networks-condition random field mould after being trained Type, i successively take 1 to the natural number between K, and one is obtained timing convolutional neural networks-condition random field mould after K training Type；The data that target corpus data are converted to character level, obtain the first data, and first data are converted to sequence number According to obtaining the second data, second data inputted timing convolutional neural networks-condition random field after the K training Timing convolutional neural networks-conditional random field models after the training of at least one of model, obtain the target corpus data Word segmentation result.

Further, the data by the character level are converted to sequence data, comprising: will by pre-arranged code mode The data of the character level are converted to the sequence data, the pre-arranged code mode be it is following any one: one-hot coding or Person's word steering volume coding.

Further, the multiple subsequence data by extraction input i-th of timing convolutional neural networks-condition In random field models, training i-th of timing convolutional neural networks-conditional random field models, when i-th after being trained Sequence convolutional neural networks-conditional random field models, comprising: the multiple subsequence data of extraction are inputted i-th of timing by S1 Convolutional neural networks carry out propagated forward, obtain the first output data, i-th of timing convolutional neural networks are described i-th Timing convolutional neural networks in a timing convolutional neural networks-conditional random field models；S2, according to first output data The value of loss function is calculated with the multiple subsequence data of input；S3, if the value of the loss function is greater than preset value, The multiple subsequence data are then inputted into i-th of timing convolutional neural networks and carry out backpropagation, and to described i-th The network parameter of timing convolutional neural networks optimizes；S4, circulation step S1 to S3, until the value of the loss function is less than Or it is equal to the preset value；S5 determines that training is completed, obtains if the value of the loss function is less than or equal to the preset value I-th of timing convolutional neural networks after to training；S6, by i-th of timing convolutional neural networks output after the training Data input i-th of condition random field, and are trained to i-th of condition random field, i-th after obtaining the training Timing convolutional neural networks-conditional random field models, i-th of condition random field are i-th of timing convolutional Neural nets Condition random field in network-conditional random field models.

Further, described that i-th of condition random field is trained, comprising: according to i-th after the training The data of timing convolutional neural networks output calculate the conditional probability of the output data of i-th of condition random field；Using most The training of maximum-likelihood estimation method obtains the maximum value of the conditional probability of the output data of i-th of condition random field.

Further, it is described by second data input timing convolutional neural networks-condition after the K training with Timing convolutional neural networks-conditional random field models after the training of at least one of airport model, obtain the target corpus The word segmentation result of data, comprising: second data are carried out by cutting according to predetermined symbol, obtain multiple sequence datas；According to The multiple sequence data is grouped by the length of sequence data, obtains L data acquisition system, every in the L data acquisition system The equal length for all sequences data that a data acquisition system includes, L are natural number, 1≤L≤K；According to used in training process The length of subsequence data filters out L instruction from timing convolutional neural networks-conditional random field models after the K training Timing convolutional neural networks-conditional random field models after white silk obtain L1 to the timing convolutional Neural after the LL training Network-conditional random field models, the timing that all sequences data for including by j-th of data acquisition system input after the Lj training are rolled up In product neural network-conditional random field models, multiple word segmentation results are obtained, wherein the timing convolution after the Lj training The length of subsequence data used in neural network-conditional random field models training process and j-th of data acquisition system packet The equal length of the sequence data contained, j successively take 1 to the natural number between L, and Lj is 1 to the natural number between K；It will be described more A word segmentation result is spliced, and the word segmentation result of the target corpus data is obtained.

On the one hand, the embodiment of the invention provides a kind of Chinese word segmentation device based on deep learning, described device include: First converting unit, for training corpus data to be converted to the data of character level；Second converting unit is used for the character The data of grade are converted to sequence data；First cutting unit is obtained for the sequence data to be carried out cutting according to predetermined symbol To multiple subsequence data, the multiple subsequence data are grouped according to the length of subsequence data, obtain K data Gather, the equal length for the subsequence data that each data acquisition system in the K data acquisition system includes, K is the nature greater than 1 Number；First determination unit, for extracting multiple subsequence data from i-th of data acquisition system and by the multiple sub- sequence of extraction Column data inputs in i-th of timing convolutional neural networks-conditional random field models, training i-th of timing convolutional Neural net Network-conditional random field models, i-th of timing convolutional neural networks-conditional random field models after being trained, i successively take 1 to Natural number between K, one is obtained timing convolutional neural networks-conditional random field models after K training；Second determines list Member obtains the first data, first data is converted to sequence for target corpus data to be converted to the data of character level Data obtain the second data, and second data are inputted timing convolutional neural networks-condition random after the K training Timing convolutional neural networks-conditional random field models after the training of at least one of field model, obtain the target corpus number According to word segmentation result.

Further, second converting unit includes: conversion subunit, for by pre-arranged code mode by the word The data of symbol grade are converted to the sequence data, the pre-arranged code mode be it is following any one: one-hot coding or word turn Vector coding.

Further, first determination unit is for executing: S1, by the multiple subsequence data input of extraction the I timing convolutional neural networks carry out propagated forward, obtain the first output data, i-th of timing convolutional neural networks are Timing convolutional neural networks in i-th of timing convolutional neural networks-conditional random field models；S2, according to described first Output data and the multiple subsequence data of input calculate the value of loss function；S3, if the value of the loss function is big In preset value, then the multiple subsequence data is inputted into i-th of timing convolutional neural networks and carry out backpropagation, and is right The network parameter of i-th of timing convolutional neural networks optimizes；S4, circulation step S1 to S3, until the loss letter Several values is less than or equal to the preset value；S5 determines instruction if the value of the loss function is less than or equal to the preset value Practice and completes, i-th of timing convolutional neural networks after being trained；S6, by i-th of timing convolutional Neural net after the training The data of network output input i-th of condition random field, and are trained to i-th of condition random field, obtain the training I-th of timing convolutional neural networks-conditional random field models afterwards, i-th of condition random field are i-th of timing volumes Condition random field in product neural network-conditional random field models.

Further, first determination unit includes: the first computation subunit, for according to i-th after the training The data of a timing convolutional neural networks output calculate the conditional probability of the output data of i-th of condition random field；First Determine subelement, the item of the output data for obtaining i-th of condition random field using maximum Likelihood training The maximum value of part probability.

Further, second determination unit includes: cutting subelement, for being counted according to predetermined symbol by described second According to cutting is carried out, multiple sequence datas are obtained；Be grouped subelement, for according to the length of sequence data by the multiple sequence number According to being grouped, L data acquisition system, all sequences data that each data acquisition system includes in the L data acquisition system are obtained Equal length, L are natural number, 1≤L≤K；Second determines subelement, is used for the subsequence data according to used in training process Timing of the length after filtering out L training in timing convolutional neural networks-conditional random field models after the K training Convolutional neural networks-conditional random field models, obtain L1 to timing convolutional neural networks-condition after the LL training with Airport model, all sequences data for including by j-th of data acquisition system input the timing convolutional neural networks-after the Lj training In conditional random field models, multiple word segmentation results are obtained, wherein the timing convolutional neural networks-article after the Lj training The sequence data that the length of subsequence data used in part random field models training process and j-th of data acquisition system include Equal length, j successively takes 1 to the natural number between L, and Lj is 1 to the natural number between K；Splice subelement, being used for will be described Multiple word segmentation results are spliced, and the word segmentation result of the target corpus data is obtained.

On the one hand, the embodiment of the invention provides a kind of storage medium, the storage medium includes the program of storage, In, equipment where controlling the storage medium in described program operation executes the above-mentioned Chinese word segmentation side based on deep learning Method.

On the one hand, the embodiment of the invention provides a kind of computer equipment, including memory and processor, the memories For storing the information including program instruction, the processor is used to control the execution of program instruction, and described program instruction is located The step of reason device loads and realizes the above-mentioned Chinese word cutting method based on deep learning when executing.

In the embodiment of the present invention, it converts target corpus data to the data of character level；It converts the data of character level to Sequence data；By in timing convolutional neural networks-conditional random field models after sequence data input training, target corpus is obtained The word segmentation result of data, timing convolutional neural networks can be expanded by way of increasing the network number of plies with the speed of exponential increase It receives domain and improves coding result so as to the longer sequence data of processing sequence length or the data of other characteristics complexity Accuracy, to improve the accuracy of Chinese word segmentation.

[Detailed description of the invention]

In order to illustrate the technical solution of the embodiments of the present invention more clearly, below will be to needed in the embodiment attached Figure is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this field For those of ordinary skill, without any creative labor, it can also be obtained according to these attached drawings other attached Figure.

Fig. 1 is a kind of flow chart of optionally Chinese word cutting method based on deep learning according to embodiments of the present invention；

Fig. 2 is a kind of schematic diagram of Chinese word segmentation device optionally based on deep learning according to embodiments of the present invention；

Fig. 3 is a kind of schematic diagram of optional computer equipment provided in an embodiment of the present invention.

[specific embodiment]

For a better understanding of the technical solution of the present invention, being retouched in detail to the embodiment of the present invention with reference to the accompanying drawing It states.

It will be appreciated that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.Base Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts it is all its Its embodiment, shall fall within the protection scope of the present invention.

The term used in embodiments of the present invention is only to be not intended to be limiting merely for for the purpose of describing particular embodiments The present invention.In the embodiment of the present invention and the "an" of singular used in the attached claims, " described " and "the" It is also intended to including most forms, unless the context clearly indicates other meaning.

It should be appreciated that term "and/or" used herein is only a kind of incidence relation for describing affiliated partner, indicate There may be three kinds of relationships, for example, A and/or B, can indicate: individualism A, exist simultaneously A and B, individualism B these three Situation.In addition, character "/" herein, typicallys represent the relationship that forward-backward correlation object is a kind of "or".

Fig. 1 is a kind of flow chart of optionally Chinese word cutting method based on deep learning according to embodiments of the present invention, such as Shown in Fig. 1, this method comprises:

Training corpus data are converted to the data of character level by step S102.

The data of character level are converted to sequence data by step S104.

Sequence data is carried out cutting according to predetermined symbol, multiple subsequence data is obtained, according to subsequence by step S106 Multiple subsequence data are grouped by the length of data, obtain K data acquisition system, each data set in K data acquisition system The equal length for the subsequence data that conjunction includes, K are the natural number greater than 1.Predetermined symbol refers to the punctuation mark for punctuate, Such as: fullstop, question mark, exclamation mark, comma, pause mark, branch, colon etc..

Step S108 extracts multiple subsequence data and multiple subsequence data by extraction from i-th of data acquisition system It inputs in i-th of timing convolutional neural networks-conditional random field models, i-th of timing convolutional neural networks-condition random of training Field model, i-th of timing convolutional neural networks-conditional random field models after being trained, i successively take 1 to the nature between K Number, one is obtained timing convolutional neural networks-conditional random field models after K training.

Target corpus data are converted to the data of character level, obtain the first data by step S110, by the first data conversion For sequence data, the second data are obtained, the second data are inputted into timing convolutional neural networks-condition random field after K training Timing convolutional neural networks-conditional random field models after the training of at least one of model obtain point of target corpus data Word result.

Corpus data is the basic resource that linguistry is carried using electronic computer as carrier, is the actual use in language In the linguistic data that really occurred.

Timing convolutional neural networks-conditional random field models (TCN-CRF) are timing convolutional neural networks (TCN) and item The binding model of part random field (CRF).Timing convolutional neural networks are a kind of time convolutional network of deep learning, condition random Field is a typical discriminative model, and condition random field assigns participle as the lexeme classification problem of word, the word of usual defined word Position information: prefix, commonly using B indicates；In word, commonly using M is indicated；Suffix, commonly using E indicates；Monosyllabic word, commonly using S indicates, condition random The process of field participle is exactly after marking to lexeme, by the word and S individual character composition participle between B and E.Such as: sentence to be segmented Are as follows: " I loves Beijing Tian An-men ", after mark: I/S love/north S/capital B/E days/B peace/M/E, word segmentation result: " I/love/Beijing/ Tian An-men ".

Also, the neuron weight in timing convolutional neural networks on same Feature Mapping face is identical, can with collateral learning, Processing speed is fast, because this timing convolutional neural networks-conditional random field models can also be realized in distributed system.

Optionally, the data of character level are converted into sequence data, comprising: by pre-arranged code mode by the number of character level According to being converted to sequence data, pre-arranged code mode be it is following any one: one-hot coding or word steering volume coding.

One-hot coding, that is, One-Hot coding, also known as an efficient coding, method are using N bit status register come to N A state is encoded, and each state has its independent register-bit, and when any, wherein only one effective. For example, one group of data is characterized in color, including yellow, red, green, after one-hot coding, yellow becomes [100], red At [010], green becomes [001] discoloration, corresponding with vector by the sequence data of one-hot coding in this way, can be used in nerve In network model.

Word steering volume coding can be word2vec, and word2vec is the efficient calculation of one kind that word is characterized as real number value vector Processing to content of text can be reduced to the vector operation in K dimensional vector space by training by method model.word2vec The term vector of output can be used to do the relevant work of many NLP (neural LISP program LISP), for example, cluster, look for synonym, Part of speech analysis etc..Such as: word2vec obtains the data of character level as feature by Feature Mapping to K dimensional vector space The sequence data of character representation.

Optionally, multiple subsequence data of extraction are inputted into i-th of timing convolutional neural networks-conditional random field models In, i-th of timing convolutional neural networks-conditional random field models of training, i-th of timing convolutional Neural net after being trained Network-conditional random field models, comprising: multiple subsequence data of extraction are inputted i-th of timing convolutional neural networks and carried out by S1 Propagated forward, obtains the first output data, i-th of timing convolutional neural networks be i-th of timing convolutional neural networks-condition with Timing convolutional neural networks in the model of airport；S2 is calculated according to the first output data and multiple subsequence data of input and is damaged Lose the value of function；Multiple subsequence data are inputted i-th of timing convolution if the value of loss function is greater than preset value by S3 Neural network carries out backpropagation, and optimizes to the network parameter of i-th of timing convolutional neural networks；S4, circulation step S1 to S3, until the value of loss function is less than or equal to preset value；S5, if the value of loss function is less than or equal to preset value, Determine that training is completed, i-th of timing convolutional neural networks after being trained；S6, by i-th of timing convolutional Neural after training The data of network output input i-th of condition random field, and are trained to i-th of condition random field, i-th after being trained A timing convolutional neural networks-conditional random field models, i-th of condition random field are i-th of timing convolutional neural networks-conditions Condition random field in random field models.

Wherein, i-th of timing convolutional neural networks of training are the values based on loss function, are specifically included: i-th of initialization The network parameter of timing convolutional neural networks is iterated i-th of timing convolutional neural networks using stochastic gradient descent method Training, every iteration once calculate the value of a loss function, and iteration repeatedly reaches minimum up to the value of loss function, trained I-th of timing convolutional neural networks and corresponding convergent network parameter after the completion.

The specific formula for calculating loss function can be with are as follows:

Loss indicates the value of loss function, and N indicates the number of the subsequence data of i-th of timing convolutional neural networks of input Amount, y⁽ⁱ⁾Indicate i-th of subsequence data of i-th of timing convolutional neural networks of input,It indicates i-th of subsequence number According to the data exported after i-th of timing convolutional neural networks of input.

Optionally, i-th of condition random field is trained, comprising: according to i-th of timing convolutional Neural net after training The data of network output calculate the conditional probability of the output data of i-th of condition random field；Use maximum Likelihood training Obtain the maximum value of the conditional probability of the output data of i-th of condition random field.

Condition random field is the Markov random field of stochastic variable Y under conditions of given stochastic variable X, Markov Some stochastic variable of random field, only stochastic variable adjacent thereto is related, unrelated with those non-conterminous stochastic variables.

In conditional probability model P (Y | X), Y is output variable, indicates flag sequence, also referred to as status switch, X are defeated Enter variable, indicates the observation sequence for needing to mark.Training data is utilized when training, and conditional probability is obtained by maximal possibility estimation Then model uses the model prediction, the output sequence Y for given list entries X, when conditional probability maximum.Commonly For linear chain conditional random, sequence X=(X1, X2 ..., the Xn) of input, the sequence of output is that Y=(Y1, Y2 ..., Yn) is The sequence of random variables that linear chain indicates, if the condition of sequence of random variables Y is general under conditions of given sequence of random variables X Rate distribution P (Y | X) constitute condition random field.

Wherein, maximal possibility estimation refers to by testing several times, observes as a result, obtaining some ginseng using test result The probability that numerical value can be such that sample occurs is maximum.Maximal possibility estimation provides the given observation data of one kind and carrys out assessment models ginseng Several method, it may be assumed that " model has been determined, unknown parameters ".Known sample data are X=(X1, X2 ..., Xn), and n is the number of sample data Amount estimates that parameter t, the likelihood function of the t relative to X are Wherein i is value 1 to the natural number of n, if t ' is to make the maximum t value of likelihood function f (t) in parameter space, t ' should Make " most probable " parameter, then t ' is exactly the maximum-likelihood estimator of t.

Optionally, the second data are inputted in timing convolutional neural networks-conditional random field models after K training extremely Timing convolutional neural networks-conditional random field models after a few training, obtain the word segmentation result of target corpus data, wrap It includes: the second data being carried out by cutting according to predetermined symbol, obtain multiple sequence datas；According to the length of sequence data by multiple sequences Column data is grouped, and obtains L data acquisition system, all sequences data that each data acquisition system includes in L data acquisition system Equal length, L are natural number, 1≤L≤K；The length of subsequence data according to used in training process is after K training Timing convolutional neural networks-condition random field after filtering out L training in timing convolutional neural networks-conditional random field models Model obtains L1 to timing convolutional neural networks-conditional random field models after the LL training, by j-th of data set The all sequences data that conjunction includes input in timing convolutional neural networks-conditional random field models after the Lj training, obtain Multiple word segmentation results, wherein used in timing convolutional neural networks-conditional random field models training process after the Lj training Subsequence data length and j-th of data acquisition system sequence data for including equal length, j successively take 1 between L from So number, Lj are 1 to the natural number between K；Multiple word segmentation results are spliced, the word segmentation result of target corpus data is obtained.

For example, it is assumed that the value of K is timing convolutional neural networks-conditional random field models after 5,5 training in training The sub-sequence length used is respectively 10,20,30,40,50, after the second data cutting, obtains 2 sequence datas, 2 sequences The length of data is respectively 20 and 50, then the length 20 and 50 of the subsequence data according to used in training process, from 5 training Timing convolutional neural networks-condition after filtering out 2 training in timing convolutional neural networks-conditional random field models afterwards with Airport model, used in timing convolutional neural networks-conditional random field models training process after the 1st training filtered out The length of subsequence data is 20, the timing convolutional neural networks after the 2nd training filtered out-conditional random field models training The length of subsequence data used in process is 50, then after the data that the length of sequence data is 20 being inputted the 1st training Timing convolutional neural networks-conditional random field models, obtain multiple word segmentation results；The data for being 50 by the length of sequence data Timing convolutional neural networks-conditional random field models after inputting the 2nd training, obtain multiple word segmentation results；By the 1st training The multiple word segmentation results of timing convolutional neural networks afterwards-conditional random field models output and the timing convolution after the 2nd training The multiple word segmentation results of neural network-conditional random field models output are spliced, and the word segmentation result of target corpus data is obtained.

Fig. 2 is a kind of schematic diagram of Chinese word segmentation device optionally based on deep learning according to embodiments of the present invention, should Device is for executing the above-mentioned Chinese word cutting method based on deep learning, as shown in Fig. 2, the device includes: the first converting unit 10, the second converting unit 20, the first cutting unit 30, the first determination unit 40, the second determination unit 50.

First converting unit 10, for training corpus data to be converted to the data of character level.

Second converting unit 20, for the data of character level to be converted to sequence data.

First cutting unit 30 obtains multiple subsequence data for sequence data to be carried out cutting according to predetermined symbol, Multiple subsequence data are grouped according to the length of subsequence data, obtain K data acquisition system, in K data acquisition system The equal length for the subsequence data that each data acquisition system includes, K are the natural number greater than 1.Predetermined symbol refers to for making pauses in reading unpunctuated ancient writings Punctuation mark, such as: fullstop, question mark, exclamation mark, comma, pause mark, branch, colon etc..

First determination unit 40, for extracting multiple subsequence data from i-th of data acquisition system and by the multiple of extraction Subsequence data input in i-th of timing convolutional neural networks-conditional random field models, i-th of timing convolutional Neural net of training Network-conditional random field models, i-th of timing convolutional neural networks-conditional random field models after being trained, i successively take 1 to Natural number between K, one is obtained timing convolutional neural networks-conditional random field models after K training.

Second determination unit 50 obtains the first data, by for target corpus data to be converted to the data of character level One data are converted to sequence data, obtain the second data, and the second data are inputted the timing convolutional neural networks-after K training Timing convolutional neural networks-conditional random field models after the training of at least one of conditional random field models, obtain target language Expect the word segmentation result of data.

Optionally, the second converting unit 20 includes: conversion subunit.Conversion subunit, for passing through pre-arranged code mode The data of character level are converted into sequence data, pre-arranged code mode be it is following any one: one-hot coding or word steering volume Coding.

Optionally, the first determination unit 40 is for executing following steps: S1, by multiple subsequence data input of extraction the I timing convolutional neural networks carry out propagated forward, obtain the first output data, i-th of timing convolutional neural networks is i-th Timing convolutional neural networks in timing convolutional neural networks-conditional random field models.S2, according to the first output data and input Multiple subsequence data calculate loss function value.S3, if the value of loss function is greater than preset value, by multiple subsequences Data input i-th of timing convolutional neural networks and carry out backpropagation, and to the network parameter of i-th of timing convolutional neural networks It optimizes.S4, circulation step S1 to S3, until the value of loss function is less than or equal to preset value.S5, if loss function Value is less than or equal to preset value, determines that training is completed, i-th of timing convolutional neural networks after being trained.S6, after training The data of i-th of timing convolutional neural networks output input i-th of condition random field, and i-th condition random field is carried out Training, i-th of timing convolutional neural networks-conditional random field models after being trained, i-th of condition random field is i-th Condition random field in timing convolutional neural networks-conditional random field models.

Optionally, the first determination unit includes: the first computation subunit, the first determining subelement.First computation subunit, For calculating the output data of i-th of condition random field according to the data of i-th of timing convolutional neural networks output after training Conditional probability.First determines subelement, for obtaining the defeated of i-th of condition random field using maximum Likelihood training The maximum value of the conditional probability of data out.

Optionally, the second determination unit 50 includes: cutting subelement, grouping subelement, the second determining subelement, splicing Unit.Cutting subelement obtains multiple sequence datas for the second data to be carried out cutting according to predetermined symbol.Grouping is single Member obtains L data acquisition system for being grouped multiple sequence datas according to the length of sequence data, in L data acquisition system The equal length for all sequences data that each data acquisition system includes, L are natural number, 1≤L≤K.Second determines subelement, uses In the subsequence data according to used in training process length from K training after timing convolutional neural networks-condition random Timing convolutional neural networks-conditional random field models after filtering out L training in field model obtain L1 to the LL and instruct Timing convolutional neural networks-conditional random field models after white silk, all sequences data input for including by j-th of data acquisition system the In timing convolutional neural networks-conditional random field models after Lj training, multiple word segmentation results are obtained, wherein the Lj instruction The length of subsequence data used in timing convolutional neural networks-conditional random field models training process after white silk with j-th The equal length for the sequence data that data acquisition system includes, j successively take 1 to the natural number between L, and Lj is 1 to the nature between K Number.Splice subelement and obtains the word segmentation result of target corpus data for splicing multiple word segmentation results.

On the one hand, the embodiment of the invention provides a kind of storage medium, storage medium includes the program of storage, wherein Equipment where control storage medium executes following steps when program is run: training corpus data are converted to the data of character level； The data of character level are converted into sequence data；Sequence data is subjected to cutting according to predetermined symbol, obtains multiple subsequence numbers According to multiple subsequence data are grouped according to the length of subsequence data, obtain K data acquisition system, in K data acquisition system Each data acquisition system subsequence data for including equal length, K is the natural number greater than 1；It is taken out from i-th of data acquisition system It takes multiple subsequence data and multiple subsequence data of extraction is inputted into i-th of timing convolutional neural networks-condition random field In model, i-th of timing convolutional neural networks-conditional random field models of training, i-th of timing convolutional Neural after being trained Network-conditional random field models, i successively take 1 to the natural number between K, and one is obtained the timing convolutional Neural net after K training Network-conditional random field models；The data that target corpus data are converted to character level, obtain the first data, and the first data are turned It is changed to sequence data, obtains the second data, the second data are inputted into timing convolutional neural networks-condition random after K training Timing convolutional neural networks-conditional random field models after the training of at least one of field model, obtain target corpus data Word segmentation result.

Optionally, when program is run, equipment where control storage medium also executes following steps: by pre-arranged code side The data of character level are converted to sequence data by formula, pre-arranged code mode be it is following any one: one-hot coding or word turn to Amount coding.

Optionally, when program is run, equipment where control storage medium also executes following steps: S1, by the multiple of extraction Subsequence data input i-th of timing convolutional neural networks and carry out propagated forward, obtain the first output data, i-th of timing volume Product neural network is the timing convolutional neural networks in i-th of timing convolutional neural networks-conditional random field models；S2, according to First output data and multiple subsequence data of input calculate the value of loss function；S3, if the value of loss function is greater than in advance If value, then multiple subsequence data is inputted into i-th of timing convolutional neural networks and carry out backpropagation, and i-th of timing is rolled up The network parameter of product neural network optimizes；S4, circulation step S1 to S3 are preset until the value of loss function is less than or equal to Value；S5 determines that training is completed, i-th of timing convolution after being trained if the value of loss function is less than or equal to preset value Neural network；The data of i-th of timing convolutional neural networks output after training are inputted i-th of condition random field by S6, and right I-th of condition random field is trained, i-th of timing convolutional neural networks-conditional random field models after being trained, and i-th A condition random field is the condition random field in i-th of timing convolutional neural networks-conditional random field models.

Optionally, when program is run, equipment where control storage medium also executes following steps: according to i-th after training The data of a timing convolutional neural networks output calculate the conditional probability of the output data of i-th of condition random field；Use maximum Likelihood estimation training obtains the maximum value of the conditional probability of the output data of i-th of condition random field.

Optionally, when program is run, equipment where control storage medium also executes following steps: will according to predetermined symbol Second data carry out cutting, obtain multiple sequence datas；Multiple sequence datas are grouped according to the length of sequence data, are obtained To L data acquisition system, the equal length for all sequences data that each data acquisition system includes in L data acquisition system, L is nature Number, 1≤L≤K；The length of subsequence data according to used in training process from K training after timing convolutional neural networks- Timing convolutional neural networks-conditional random field models after filtering out L training in conditional random field models obtain L1 extremely Timing convolutional neural networks-conditional random field models after the LL training, all sequences number for including by j-th of data acquisition system According in timing convolutional neural networks-conditional random field models after the Lj training of input, multiple word segmentation results are obtained, wherein The length of subsequence data used in timing convolutional neural networks-conditional random field models training process after the Lj training The equal length for the sequence data for including with j-th of data acquisition system, j successively take 1 to the natural number between L, and Lj is 1 between K Natural number；Multiple word segmentation results are spliced, the word segmentation result of target corpus data is obtained.

On the one hand, the embodiment of the invention provides a kind of computer equipments, including memory and processor, memory to be used for Storage includes the information of program instruction, and processor is used to control the execution of program instruction, and program instruction is loaded and held by processor The data that training corpus data are converted to character level are performed the steps of when row；The data of character level are converted into sequence number According to；Sequence data is subjected to cutting according to predetermined symbol, obtains multiple subsequence data, it will be more according to the length of subsequence data A sub- sequence data is grouped, and obtains K data acquisition system, the subsequence that each data acquisition system in K data acquisition system includes The equal length of data, K are the natural number greater than 1；Multiple subsequence data are extracted from i-th of data acquisition system and by extraction Multiple subsequence data input in i-th of timing convolutional neural networks-conditional random field models, i-th of timing convolution mind of training Through network-conditional random field models, i-th of timing convolutional neural networks-conditional random field models after being trained, i is successively 1 is taken to the natural number between K, one is obtained timing convolutional neural networks-conditional random field models after K training；By target Corpus data is converted to the data of character level, obtains the first data, and the first data are converted to sequence data, obtains the second number According to after the second data are inputted the training of at least one of timing convolutional neural networks-conditional random field models after K training Timing convolutional neural networks-conditional random field models, obtain the word segmentation result of target corpus data.

Optionally, also performing the steps of when program instruction is loaded and executed by processor will by pre-arranged code mode The data of character level are converted to sequence data, pre-arranged code mode be it is following any one: one-hot coding or word steering volume are compiled Code.

Optionally, S1 is also performed the steps of when program instruction is loaded and executed by processor, by multiple sub- sequences of extraction Column data inputs i-th of timing convolutional neural networks and carries out propagated forward, obtains the first output data, i-th of timing convolution mind Through the timing convolutional neural networks that network is in i-th of timing convolutional neural networks-conditional random field models；S2, according to first Output data and multiple subsequence data of input calculate the value of loss function；S3, if the value of loss function is greater than preset value, Multiple subsequence data are then inputted into i-th of timing convolutional neural networks and carry out backpropagation, and to i-th of timing convolutional Neural The network parameter of network optimizes；S4, circulation step S1 to S3, until the value of loss function is less than or equal to preset value；S5, If the value of loss function is less than or equal to preset value, determine that training is completed, i-th of timing convolutional Neural net after being trained Network；The data of i-th of timing convolutional neural networks output after training are inputted i-th of condition random field by S6, and to i-th Condition random field is trained, i-th of timing convolutional neural networks-conditional random field models after being trained, i-th of condition Random field is the condition random field in i-th of timing convolutional neural networks-conditional random field models.

Optionally, when also performed the steps of when program instruction is loaded and executed by processor according to i-th after training The data of sequence convolutional neural networks output calculate the conditional probability of the output data of i-th of condition random field；Use maximum likelihood Estimation method training obtains the maximum value of the conditional probability of the output data of i-th of condition random field.

Optionally, it is also performed the steps of second when program instruction is loaded and executed by processor according to predetermined symbol Data carry out cutting, obtain multiple sequence datas；Multiple sequence datas are grouped according to the length of sequence data, obtain L A data acquisition system, the equal length for all sequences data that each data acquisition system includes in L data acquisition system, L are natural number, 1 ≤L≤K；The length of subsequence data according to used in training process from K training after timing convolutional neural networks-condition Timing convolutional neural networks-conditional random field models after filtering out L training in random field models obtain L1 to LL Timing convolutional neural networks-conditional random field models after a training, all sequences data for including by j-th of data acquisition system are defeated In timing convolutional neural networks-conditional random field models after entering the Lj training, multiple word segmentation results are obtained, wherein Lj The length of subsequence data used in timing convolutional neural networks-conditional random field models training process after a training and the The equal length for the sequence data that j data acquisition system includes, j successively take 1 to the natural number between L, Lj be 1 between K from So number；Multiple word segmentation results are spliced, the word segmentation result of target corpus data is obtained.

Fig. 3 is a kind of schematic diagram of computer equipment provided in an embodiment of the present invention.As shown in figure 3, the meter of the embodiment Machine equipment 50 is calculated to include: processor 51, memory 52 and be stored in the meter that can be run in memory 52 and on processor 51 Calculation machine program 53 realizes the Chinese word segmentation based on deep learning in embodiment when the computer program 53 is executed by processor 51 Method does not repeat one by one herein to avoid repeating.Alternatively, being realized in embodiment when the computer program is executed by processor 51 The function of each model/unit does not repeat one by one herein in Chinese word segmentation device based on deep learning to avoid repeating.

Computer equipment 50 can be desktop PC, notebook, palm PC and cloud server etc. and calculate equipment. Computer equipment may include, but be not limited only to, processor 51, memory 52.It will be understood by those skilled in the art that Fig. 3 is only It is the example of computer equipment 50, does not constitute the restriction to computer equipment 50, may include more more or fewer than illustrating Component perhaps combines certain components or different components, such as computer equipment can also include input-output equipment, net Network access device, bus etc..

Alleged processor 51 can be central processing unit (Central Processing Unit, CPU), can also be Other general processors, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field- Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic, Discrete hardware components etc..General processor can be microprocessor or the processor is also possible to any conventional processor Deng.

Memory 52 can be the internal storage unit of computer equipment 50, such as the hard disk or interior of computer equipment 50 It deposits.Memory 52 is also possible to the plug-in type being equipped on the External memory equipment of computer equipment 50, such as computer equipment 50 Hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) etc..Further, memory 52 can also both including computer equipment 50 internal storage unit and also including External memory equipment.Memory 52 is for storing other programs and data needed for computer program and computer equipment.It deposits Reservoir 52 can be also used for temporarily storing the data that has exported or will export.

It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description, The specific work process of device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.

In several embodiments provided by the present invention, it should be understood that disclosed system, device and method can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit It divides, only a kind of logical function partition, there may be another division manner in actual implementation, for example, multiple units or group Part can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown Or the mutual coupling, direct-coupling or communication connection discussed can be through some interfaces, device or unit it is indirect Coupling or communication connection can be electrical property, mechanical or other forms.

The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.

It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of hardware adds SFU software functional unit.

The above-mentioned integrated unit being realized in the form of SFU software functional unit can store and computer-readable deposit at one In storage media.Above-mentioned SFU software functional unit is stored in a storage medium, including some instructions are used so that a computer It is each that device (can be personal computer, server or network equipment etc.) or processor (Processor) execute the present invention The part steps of embodiment the method.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (Read- Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic or disk etc. it is various It can store the medium of program code.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Within mind and principle, any modification, equivalent substitution, improvement and etc. done be should be included within the scope of the present invention.

Claims

1. a kind of Chinese word cutting method based on deep learning, which is characterized in that the described method includes:

Training corpus data are converted to the data of character level；

The data of the character level are converted into sequence data；

The sequence data is subjected to cutting according to predetermined symbol, multiple subsequence data are obtained, according to the length of subsequence data The multiple subsequence data are grouped by degree, obtain K data acquisition system, each data set in the K data acquisition system The equal length for the subsequence data that conjunction includes, K are the natural number greater than 1；

Multiple subsequence data are extracted from i-th of data acquisition system and input the multiple subsequence data of extraction i-th In timing convolutional neural networks-conditional random field models, training i-th of timing convolutional neural networks-condition random field mould Type, i-th of timing convolutional neural networks-conditional random field models after being trained, i successively take 1 to the natural number between K, One is obtained timing convolutional neural networks-conditional random field models after K training；

The data that target corpus data are converted to character level, obtain the first data, and first data are converted to sequence number According to obtaining the second data, second data inputted timing convolutional neural networks-condition random field after the K training Timing convolutional neural networks-conditional random field models after the training of at least one of model, obtain the target corpus data Word segmentation result.

2. the method according to claim 1, wherein the data by the character level are converted to sequence number According to, comprising:

The data of the character level are converted into the sequence data by pre-arranged code mode, the pre-arranged code mode be with Descend any one: one-hot coding or word steering volume coding.

3. the method according to claim 1, wherein the multiple subsequence data input by extraction the In i timing convolutional neural networks-conditional random field models, training i-th of timing convolutional neural networks-condition random field Model, i-th of timing convolutional neural networks-conditional random field models after being trained, comprising:

The multiple subsequence data of extraction are inputted i-th of timing convolutional neural networks and carry out propagated forward by S1, obtain the One output data, i-th of timing convolutional neural networks are i-th of timing convolutional neural networks-condition random field moulds Timing convolutional neural networks in type；

S2 calculates the value of loss function according to first output data and the multiple subsequence data of input；

The multiple subsequence data are inputted i-th of timing if the value of the loss function is greater than preset value by S3 Convolutional neural networks carry out backpropagation, and optimize to the network parameter of i-th of timing convolutional neural networks；

S4, circulation step S1 to S3, until the value of the loss function is less than or equal to the preset value；

S5 determines that training is completed if the value of the loss function is less than or equal to the preset value, i-th after being trained A timing convolutional neural networks；

The data of i-th of timing convolutional neural networks output after the training are inputted i-th of condition random field by S6, and right I-th of condition random field is trained, i-th of timing convolutional neural networks-condition random field after obtaining the training Model, i-th of condition random field be condition in i-th of timing convolutional neural networks-conditional random field models with Airport.

4. according to the method described in claim 3, it is characterized in that, described be trained i-th of condition random field, packet It includes:

I-th of condition random field is calculated according to the data of i-th of timing convolutional neural networks output after the training The conditional probability of output data；

The maximum of the conditional probability of the output data of i-th of condition random field is obtained using maximum Likelihood training Value.

5. method according to any one of claims 1 to 4, which is characterized in that described that second data are inputted the K Timing convolutional Neural net after the training of at least one of timing convolutional neural networks-conditional random field models after a training Network-conditional random field models obtains the word segmentation result of the target corpus data, comprising:

Second data are subjected to cutting according to predetermined symbol, obtain multiple sequence datas；

The multiple sequence data is grouped according to the length of sequence data, obtains L data acquisition system, the L data The equal length for all sequences data that each data acquisition system includes in set, L are natural number, 1≤L≤K；

The length of subsequence data according to used in training process is from timing convolutional neural networks-item after the K training Timing convolutional neural networks-conditional random field models after filtering out L training in part random field models obtain L1 to the Timing convolutional neural networks-conditional random field models after LL training, all sequences data for including by j-th of data acquisition system In timing convolutional neural networks-conditional random field models after inputting the Lj training, multiple word segmentation results are obtained, wherein institute The length of subsequence data used in timing convolutional neural networks-conditional random field models training process after stating the Lj training The equal length for the sequence data that degree includes with j-th of data acquisition system, j successively take 1 to the natural number between L, Lj be 1 to Natural number between K；

The multiple word segmentation result is spliced, the word segmentation result of the target corpus data is obtained.

6. a kind of Chinese word segmentation device based on deep learning, which is characterized in that described device includes:

First converting unit, for training corpus data to be converted to the data of character level；

Second converting unit, for the data of the character level to be converted to sequence data；

First cutting unit obtains multiple subsequence data, root for the sequence data to be carried out cutting according to predetermined symbol The multiple subsequence data are grouped according to the length of subsequence data, obtain K data acquisition system, the K data set The equal length for the subsequence data that each data acquisition system in conjunction includes, K are the natural number greater than 1；

First determination unit, for extracting multiple subsequence data from i-th of data acquisition system and by the multiple son of extraction Sequence data inputs in i-th of timing convolutional neural networks-conditional random field models, training i-th of timing convolutional Neural Network-conditional random field models, i-th of timing convolutional neural networks-conditional random field models after being trained, i successively take 1 To the natural number between K, one is obtained timing convolutional neural networks-conditional random field models after K training；

Second determination unit obtains the first data, by described first for target corpus data to be converted to the data of character level Data are converted to sequence data, obtain the second data, and second data are inputted the timing convolutional Neural after the K training Timing convolutional neural networks-conditional random field models after the training of at least one of network-conditional random field models, obtain institute State the word segmentation result of target corpus data.

7. device according to claim 6, which is characterized in that second converting unit includes:

Conversion subunit, it is described for the data of the character level to be converted to the sequence data by pre-arranged code mode Pre-arranged code mode be it is following any one: one-hot coding or word steering volume coding.

8. device according to claim 6, which is characterized in that first determination unit is for executing:

9. a kind of storage medium, which is characterized in that the storage medium includes the program of storage, wherein run in described program When control the storage medium where equipment perform claim require any one of 1 to 5 described in the Chinese based on deep learning Segmenting method.

10. a kind of computer equipment, including memory and processor, the memory is for storing the letter including program instruction Breath, the processor are used to control the execution of program instruction, it is characterised in that: described program instruction is loaded and executed by processor The step of Chinese word cutting method described in Shi Shixian claim 1 to 5 any one based on deep learning.