CN110334110A

CN110334110A - Natural language classification method, device, computer equipment and storage medium

Info

Publication number: CN110334110A
Application number: CN201910449416.4A
Authority: CN
Inventors: 周罡
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-05-28
Filing date: 2019-05-28
Publication date: 2019-10-15
Also published as: WO2020238061A1

Abstract

The embodiment of the invention discloses a kind of natural language classification method, device, computer equipment and storage mediums, wherein the described method includes: the natural language data of acquisition user's input, and by the natural language data conversion at corresponding text data；The text data is segmented, the word segmentation result of the text data is obtained, the word segmentation result includes one or more word；Using the word in word segmentation result as input, the word segmentation result of text data is trained using default term vector model, obtains output as a result, the output result includes that the corresponding vector of each word indicates；Term vector training result is input to the neural network model for being used for natural language classification that training in advance obtains, obtains the classification results for natural language data.The present invention is based on detection models to provide a kind of natural language classification method, can carry out Accurate classification to Natural Query, provide diversification data base querying mode, improve the usage experience of user.

Description

Natural language classification method, device, computer equipment and storage medium

Technical field

The present invention relates to field of computer technology more particularly to a kind of natural language classification methods, device, computer equipment And storage medium.

Background technique

Currently, the Natural Query of human oral, which is converted into computer, can identify query statement, typically will Natural Query is converted into specific computer inquery sentence, this result in some databases can not identify converted it is specific Computer inquery sentence, for example, the computer inquery sentence being currently converted into is SQL (Structured Query Language, referred to as: structured query language) query statement, then can identify that SQL is looked into for relevant database Sentence is ask, but for chart database, can not just identify SQL query statement, therefore, traditional Natural Query conversion It is impossible to meet the market demands for mode.

Summary of the invention

In view of this, the embodiment of the present invention provides a kind of natural language classification method, device, computer equipment and storage Medium can carry out Accurate classification to Natural Query, provide diversification data base querying mode, and improve user uses body It tests.

On the one hand, the embodiment of the invention provides a kind of natural language classification methods, this method comprises:

The natural language data of user's input are acquired, and by the natural language data conversion at corresponding text data；

The text data is segmented, the word segmentation result of the text data is obtained, the word segmentation result includes one A or multiple words；

Using the word in the word segmentation result as input, using default term vector model to the participle of the text data As a result it is trained, obtains output as a result, the output result includes that the corresponding vector of each word indicates；

Term vector training result is input to the neural network model for being used for natural language classification that training in advance obtains, is obtained To the classification results for being directed to natural language data.

On the other hand, the embodiment of the invention provides a kind of natural language sorter, described device includes:

Converting unit, for acquiring the natural language data of user's input, and the natural language data conversion is pairs of The text data answered；

Participle unit obtains the word segmentation result of the text data for segmenting the text data, and described point Word result includes one or more word；

Training unit, for using the word in the word segmentation result as input, using default term vector model to described The word segmentation result of text data is trained, and obtains output as a result, the output result includes the corresponding vector table of each word Show；

Taxon, for term vector training result to be input to the mind for being used for natural language classification that training in advance obtains Through network model, the classification results for natural language data are obtained.

Another aspect the embodiment of the invention also provides a kind of computer equipment, including memory, processor and is stored in On the memory and the computer program that can run on the processor, when the processor executes the computer program Realize natural language classification method as described above.

It is described computer-readable to deposit in another aspect, the embodiment of the invention also provides a kind of computer readable storage medium Storage media is stored with one or more than one computer program, and the one or more computer program can be by one Or more than one processor executes, to realize natural language classification method as described above.

The embodiment of the present invention provides a kind of natural language classification method, device, computer equipment and storage medium, wherein Method includes: the natural language data for acquiring user and inputting, and by the natural language data conversion at corresponding text data； The text data is segmented, the word segmentation result of the text data is obtained, the word segmentation result is including one or more A word；Using the word in the word segmentation result as input, using default term vector model to the participle of the text data As a result it is trained, obtains output as a result, the output result includes that the corresponding vector of each word indicates；By term vector training As a result it is input to the neural network model for natural language classification that training in advance obtains, is obtained for natural language data Classification results.The present invention is based on detection models to provide a kind of natural language classification method, can carry out to Natural Query quasi- Really classification, provides diversification data base querying mode, improves the usage experience of user.

Detailed description of the invention

Technical solution in order to illustrate the embodiments of the present invention more clearly, below will be to needed in embodiment description Attached drawing is briefly described, it should be apparent that, drawings in the following description are some embodiments of the invention, general for this field For logical technical staff, without creative efforts, it is also possible to obtain other drawings based on these drawings.

Fig. 1 is a kind of application scenarios schematic diagram of natural language classification method provided in an embodiment of the present invention；

Fig. 2 is a kind of schematic flow diagram of natural language classification method provided in an embodiment of the present invention；

Fig. 3 is a kind of another schematic flow diagram of natural language classification method provided in an embodiment of the present invention；

Fig. 4 is a kind of another schematic flow diagram of natural language classification method provided in an embodiment of the present invention；

Fig. 5 is a kind of schematic block diagram of natural language sorter provided in an embodiment of the present invention；

Fig. 6 is a kind of another schematic block diagram of natural language sorter provided in an embodiment of the present invention；

Fig. 7 is a kind of another schematic block diagram of natural language sorter provided in an embodiment of the present invention；

Fig. 8 is a kind of another schematic block diagram of natural language sorter provided in an embodiment of the present invention；

Fig. 9 is a kind of structure composition schematic diagram of computer equipment provided in an embodiment of the present invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall within the protection scope of the present invention.

It should be appreciated that ought use in this specification and in the appended claims, term " includes " and "comprising" instruction Described feature, entirety, step, operation, the presence of element and/or component, but one or more of the other feature, whole is not precluded Body, step, operation, the presence or addition of element, component and/or its set.

It is also understood that mesh of the term used in this description of the invention merely for the sake of description specific embodiment And be not intended to limit the present invention.As description of the invention and it is used in the attached claims, unless on Other situations are hereafter clearly indicated, otherwise " one " of singular, "one" and "the" are intended to include plural form.

It will be further appreciated that the present invention illustrate in mountain and the appended claims used in term "and/or" Refer to any combination and all possible combinations of one or more of associated item listed, and including these combinations.

Fig. 1 and Fig. 2 are please referred to, Fig. 1 is a kind of application scenarios of natural language classification method provided in an embodiment of the present invention Schematic diagram, Fig. 2 are a kind of flow diagram of natural language classification method provided in an embodiment of the present invention.Natural language classification Method is applied in server or terminal, wherein terminal can be smart phone, tablet computer, laptop, desktop computer, Personal digital assistant and wearable device etc. have the electronic equipment of communication function.As an application, as shown in Figure 1, the nature Language classification method is applied in server 10, which can be a server in Distributed Services platform, should Server 10 executes natural language sort instructions, and by implementing result feedback in terminal 20.

It should be noted that only illustrate a terminal 20 in Fig. 1, in the actual operation process, server 10 can be with Implementing result is fed back in more terminals 20.

Referring to Fig. 2, Fig. 2 is a kind of schematic flow diagram of natural language classification method provided in an embodiment of the present invention.Such as Shown in Fig. 2, this approach includes the following steps S101~S104.

S101, the natural language data of acquisition user's input, and by the natural language data conversion at corresponding text Data.

In embodiments of the present invention, the natural language data refer to the natural language retrieval for database of user's oral account Language, such as: user oral account Natural Query are as follows: " this year insurance net profit be how many? ", more specifically, Ke Yitong The natural language data of microphone acquisition user's input in terminal are crossed, and by natural language data conversion collected at corresponding Text data.

Further, as shown in figure 3, it is described by the natural language data conversion at corresponding text data the step of, Specifically include step S201~S204:

S201 utilizes the natural language data of microphone acquisition user's input；

Natural language data progress digitized processing is obtained voice signal by S202；

S203 extracts the acoustic feature of the voice signal；

The acoustic feature is input to predetermined acoustic model and is decoded by S204, to generate the text data.

In the same embodiment, by the natural language data conversion at corresponding text data, due to natural language Data are voice signal, and voice messaging belongs to analog signal, it is therefore desirable to handle the voice signal of simulation, be counted Word extracts the acoustic feature of voice signal.Wherein, such as mel-frequency cepstrum coefficient MFCC, linear prediction cepstrum coefficient can be used The methods of coefficient LPCC, Multimedia Content Description Interface MPEG7 extract acoustic feature.Then, acoustic feature can be input to Acoustic model is decoded, to obtain text data corresponding to voice signal.Namely the natural language data are turned Change the process of corresponding text data into.

S102 segments the text data, obtains the word segmentation result of the text data, the word segmentation result packet Include one or more word.

In embodiments of the present invention, described to segment the text data, comprising: using based on probability statistics model Segmenting method the text data is segmented.For example, enabling C=C1C2...Cm, C is that text data to be segmented is corresponding Chinese character string, enable W=W1W2...Wn, W be participle as a result, Wa, Wb ..., Wk are all possible participle schemes of C.That , the participle model based on probability statistics is to find purpose word string W, so that W meets: P (W | C)=MAX (P (Wa | C), P (Wb | C) ... P (Wk | C)) participle model, the word string W i.e. estimated probability that above-mentioned participle model obtains is the word string of maximum, and Using word string W as the word segmentation result obtained after text data participle.Such as: " this year, the net profit of insurance was text data How much? ", the word segmentation result that is obtained after being segmented by above-mentioned participle model are as follows: and " this year ", " insurance ", " net profit ", "Yes", " how many ", "? ".

S103, using the word in the word segmentation result as input, using default term vector model to the text data Word segmentation result be trained, obtain output as a result, the output result includes the corresponding vector expression of each word.

In embodiments of the present invention, the default term vector model refers to being based on word2vec deep learning model, In the present embodiment, specific training process is to utilize the word2vec deep learning model in the Gensim in Python kit The word segmentation result of the text data is trained, using the word in the word segmentation result as input, with term vector training As a result as output, the term vector result includes that the corresponding vector of each word indicates.

Further, as shown in figure 4, the step S103 includes step S301~S302:

The word segmentation result of the text data is input in Python kit Gensim by S301；

S302, using in Python kit Gensim based on word2vec deep learning model to the text data Word segmentation result be trained, to obtain the output result.

In the present embodiment, using the Gensim in Python kit and to the word2vec deep learning in kit Model carries out following parameter setting:

After the completion of being trained by the word2vec model in Python kit Gensim, vectors.bin is obtained This file includes each word and the corresponding term vector of each word of text data in vectors.bin, in the present embodiment In, the dimension of term vector is preset using the size parameter in Python kit Gensim.

Term vector training result is input to the neural network mould for being used for natural language classification that training in advance obtains by S104 Type obtains the classification results for natural language data.

In embodiments of the present invention, the neural network model are as follows:

O_t=g (VS_t)

S_t=f (UX_t+S_t-1)；

Wherein, X_tIt is the value of Recognition with Recurrent Neural Network input layer, S_t、S_t-1It is the value of Recognition with Recurrent Neural Network hidden layer, O_tIt is to follow The value of ring neural network output layer, U are first weight matrix of the input layer to hidden layer, and V is hidden layer to the second of output layer Weight matrix, g () are nonlinear activation primitive, and f () is softmax function.

It should be noted that needing neural network mould of the training for natural language classification in advance before step S104 Type, training process are as follows: in the screening model for being used for part-of-speech tagging that the input of history term vector data is constructed in advance, obtaining needle Part of speech probability corresponding to each history term vector is set in advance if the corresponding part of speech probability of each history term vector is greater than or equal to Corresponding history term vector is labeled as the term vector of target part of speech by the first probability set, if the corresponding part of speech of each term vector Probability is greater than or equal to pre-set second probability, and corresponding history term vector is labeled as to the term vector of condition part of speech, if The corresponding part of speech probability of each term vector is greater than or equal to pre-set third probability, and corresponding history term vector is labeled as The term vector of time part of speech；More specifically, model is carried out to history term vector according to NB Algorithm in the present embodiment The constructed screening model of training；The screening model is used to judge that the term vector of input to be the vector of target part of speech, condition word The term vector of property or the term vector of time part of speech.

Wherein, it when building is used for the screening model of part-of-speech tagging, needs multiple term vectors included in training set As the input of screening model, and using the corresponding part of speech of each term vector as the output of screening model, it is trained and is sieved Modeling type.The model of the NB Algorithm of use is as follows:

Wherein,

N_ckIndicate c in training set_kThe number of class document, N indicate term vector sum in training set；T_jkIndicate lexical item t_jIn class The number occurred in other ck, V are the lexical item sets of all categories.Classification by above-mentioned screening model as the part of speech of term vector Device can judge that the term vector of input is the term vector of the vector of target part of speech, the term vector of condition part of speech or time part of speech. For example, each term vector is input in the model of NB Algorithm, when the data appear in the vector class of target part of speech Other probability is greater than or equal to the vector that the data can be then considered as target part of speech by 50% (i.e. the first probability is set as 50%)；When The corresponding part of speech probability of term vector is greater than or equal to 50% (i.e. the second probability setting in the probability of the term vector classification of condition part of speech 50%), term vector to be labeled as to the term vector of condition part of speech；When the corresponding part of speech probability of term vector time part of speech word to The probability for measuring classification is greater than or equal to 50% (i.e. third probability is set as 50%), by term vector be labeled as the word of time part of speech to Amount.

Using the term vector result after progress part-of-speech tagging as the input of neural network, and corresponding term vector is classified and is tied Output of the fruit as Recognition with Recurrent Neural Network, is trained to obtain neural network model, by the way that history term vector is carried out part of speech mark Input of the word segmentation result as neural network after note, and using corresponding term vector classification results as the defeated of Recognition with Recurrent Neural Network Out, can train obtain the first weight matrix, the second weight matrix and, neural network model obtains conduct in this way The model of subsequent term vector classification.After obtaining preparatory trained neural network model, by the term vector training knot of user Fruit is input in the neural network model that training obtains in advance, is carried out according to term vector of the preset neural network model to user Quick and intelligentized classification.For example, for text data " this year insurance net profit be how many? ", by participle and vector The term vector of 6 dimensions is input to preparatory trained neural network model by the term vector that 6 dimensions have been obtained after expression Later, then the classification results exported are Account- net profit (Account expression belongs to target word), Entity- life insurance (Entity condition word), NTR- this year (time word).

As seen from the above, the embodiment of the present invention is by the natural language data of acquisition user's input, and by the natural language Say data conversion at corresponding text data；The text data is segmented, the word segmentation result of the text data is obtained, The word segmentation result includes one or more word；Using the word in the word segmentation result as input, using default word to Amount model is trained the word segmentation result of the text data, obtains output as a result, the output result includes each word Corresponding vector indicates；Term vector training result is input to the neural network for being used for natural language classification that training in advance obtains Model obtains the classification results for natural language data.The present invention is based on detection models to provide a kind of natural language classification side Method can carry out Accurate classification to Natural Query, provide diversification data base querying mode, and improve user uses body It tests.

Referring to Fig. 5, a kind of corresponding above-mentioned natural language classification method, the embodiment of the present invention also propose a kind of natural language Sorter, the device 100 include: converting unit 101, participle unit 102, training unit 103, taxon 104.

Wherein, converting unit 101, for acquiring the natural language data of user's input, and by the natural language data It is converted into corresponding text data；

Participle unit 102 obtains the word segmentation result of the text data, institute for segmenting the text data Stating word segmentation result includes one or more word；

Training unit 103, for using the word in the word segmentation result as input, using default term vector model to institute The word segmentation result for stating text data is trained, and obtains output as a result, the output result includes the corresponding vector of each word It indicates；

Taxon 104 is classified for term vector training result to be input to the natural language that is used for that training in advance obtains Neural network model, obtain the classification results for natural language data.

Referring to Fig. 6, the converting unit 101, comprising:

Acquisition unit 101a, for the natural language data using microphone acquisition user's input；

Processing unit 101b, for natural language data progress digitized processing to be obtained voice signal；

Extraction unit 101c, for extracting the acoustic feature of the voice signal；

Generation unit 101d is decoded for the acoustic feature to be input to predetermined acoustic model, described in generating Text data.

Referring to Fig. 7, the participle unit 102, comprising:

Subelement 102a is segmented, for using the segmenting method based on probability statistics model to divide the text data Word.

Referring to Fig. 8, the training unit 103, comprising:

Input unit 103a, for the word segmentation result of the text data to be input in Python kit Gensim；

Training subelement 103b, for using in Python kit Gensim based on word2vec deep learning model The word segmentation result of the text data is trained, to obtain the output result.

Above-mentioned natural language sorter and above-mentioned natural language classification method correspond, specific principle and process It is identical as above-described embodiment the method, it repeats no more.

Above-mentioned natural language sorter can be implemented as a kind of form of computer program, and computer program can be such as It is run in computer equipment shown in Fig. 9.

Fig. 9 is a kind of structure composition schematic diagram of computer equipment of the present invention.The equipment can be terminal, be also possible to take Business device, wherein terminal can be smart phone, tablet computer, laptop, desktop computer, personal digital assistant and wearing Formula device etc. has the electronic device of communication function and speech voice input function.Server can be independent server, can also be with It is the server cluster of multiple server compositions.Referring to Fig. 9, which includes being connected by system bus 501 Processor 502, non-volatile memory medium 503, built-in storage 504 and network interface 505.Wherein, the computer equipment 500 Non-volatile memory medium 503 can storage program area 5031 and computer program 5032, which is performed When, it may make processor 502 to execute a kind of natural language classification method.The processor 502 of the computer equipment 500 is for providing Calculating and control ability, support the operation of entire computer equipment 500.The built-in storage 504 is non-volatile memory medium 503 In computer program 5032 operation provide environment, when which is executed by processor, processor 502 may make to hold A kind of natural language classification method of row.The network interface 505 of computer equipment 500 is for carrying out network communication.Art technology Personnel are appreciated that structure shown in Fig. 9, and only the block diagram of part-structure relevant to application scheme, is not constituted Restriction to the computer equipment that application scheme is applied thereon, specific computer equipment may include than as shown in the figure More or fewer components perhaps combine certain components or with different component layouts.

Wherein, following operation is realized when the processor 502 executes the computer program:

In one embodiment, the natural language data of the acquisition user input, and the natural language data are turned Change corresponding text data into, comprising:

Utilize the natural language data of microphone acquisition user's input；

Natural language data progress digitized processing is obtained into voice signal；

Extract the acoustic feature of the voice signal；

The acoustic feature is input to predetermined acoustic model to be decoded, to generate the text data.

It is in one embodiment, described to segment the text data, comprising:

The text data is segmented using the segmenting method based on probability statistics model.

In one embodiment, described to be instructed the word segmentation result of the text data using default term vector model Practice, obtain term vector training result, comprising:

The word segmentation result of the text data is input in Python kit Gensim；

Utilize dividing based on word2vec deep learning model the text data in Python kit Gensim Word result is trained, to obtain the output result.

In one embodiment, the neural network model are as follows:

O_t=g (VS_t)

S_t=f (UX_t+S_t-1)；

It will be understood by those skilled in the art that the embodiment of computer equipment shown in Fig. 9 is not constituted to computer The restriction of equipment specific composition, in other embodiments, computer equipment may include components more more or fewer than diagram, or Person combines certain components or different component layouts.For example, in some embodiments, computer equipment only includes memory And processor, in such embodiments, the structure and function of memory and processor are consistent with embodiment illustrated in fig. 9, herein It repeats no more.

The present invention provides a kind of computer readable storage medium, computer-readable recording medium storage has one or one A above computer program, the one or more computer program can be held by one or more than one processor Row, to perform the steps of

Utilize the natural language data of microphone acquisition user's input；

Extract the acoustic feature of the voice signal；

It is in one embodiment, described to segment the text data, comprising:

The word segmentation result of the text data is input in Python kit Gensim；

In one embodiment, the neural network model are as follows:

O_t=g (VS_t)

S_t=f (UX_t+S_t-1)；

Present invention storage medium above-mentioned include: magnetic disk, CD, read-only memory (Read-Only Memory, The various media that can store program code such as ROM).

Unit in all embodiments of the invention can pass through universal integrated circuit, such as CPU (Central Processing Unit, central processing unit), or pass through ASIC (Application Specific Integrated Circuit, specific integrated circuit) Lai Shixian.

Step in natural language classification method of the embodiment of the present invention can the adjustment of carry out sequence, merging according to actual needs With delete.

Unit in natural language sorter of the embodiment of the present invention can be merged according to actual needs, divides and be deleted Subtract.

The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any Those familiar with the art in the technical scope disclosed by the present invention, can readily occur in various equivalent modifications or replace It changes, these modifications or substitutions should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with right It is required that protection scope subject to.

Claims

1. a kind of natural language classification method, which is characterized in that the described method includes:

The text data is segmented, obtains the word segmentation result of the text data, the word segmentation result include one or The multiple words of person；

Using the word in the word segmentation result as input, using default term vector model to the word segmentation result of the text data It is trained, obtains output as a result, the output result includes that the corresponding vector of each word indicates；

Term vector training result is input to the neural network model for being used for natural language classification that training in advance obtains, obtains needle To the classification results of natural language data.

2. the method as described in claim 1, which is characterized in that the natural language data of the acquisition user input, and by institute Natural language data conversion is stated into corresponding text data, comprising:

Utilize the natural language data of microphone acquisition user's input；

Extract the acoustic feature of the voice signal；

3. the method as described in claim 1, which is characterized in that described to segment the text data, comprising:

4. the method as described in claim 1, which is characterized in that the word using in the word segmentation result makes as input The word segmentation result of the text data is trained with default term vector model, obtains output as a result, the output result packet Including the corresponding vector of each word indicates, comprising:

The word segmentation result of the text data is input in Python kit Gensim；

Using in Python kit Gensim based on word2vec deep learning model to the participle knot of the text data Fruit is trained, to obtain the output result.

5. the method as described in claim 1, which is characterized in that the neural network model are as follows:

O_t=g (VS_t)

S_t=f (UX_t+S_t-1)；

Wherein, X_tIt is the value of Recognition with Recurrent Neural Network input layer, S_t、S_t-1It is the value of Recognition with Recurrent Neural Network hidden layer, O_tIt is circulation mind Value through network output layer, U are first weight matrix of the input layer to hidden layer, and V is second weight of the hidden layer to output layer Matrix, g () are nonlinear activation primitive, and f () is softmax function.

6. a kind of natural language sorter, which is characterized in that described device includes:

Converting unit, for acquiring the natural language data of user's input, and by the natural language data conversion at corresponding Text data；

Participle unit obtains the word segmentation result of the text data, the participle knot for segmenting the text data Fruit includes one or more word；

Training unit, for using the word in the word segmentation result as input, using default term vector model to the text The word segmentation result of data is trained, and obtains output as a result, the output result includes that the corresponding vector of each word indicates；

Taxon, for term vector training result to be input to the nerve net for being used for natural language classification that training in advance obtains Network model obtains the classification results for natural language data.

7. device as claimed in claim 6, which is characterized in that the converting unit, comprising:

Acquisition unit, for the natural language data using microphone acquisition user's input；

Processing unit, for natural language data progress digitized processing to be obtained voice signal；

Extraction unit, for extracting the acoustic feature of the voice signal；

Generation unit is decoded for the acoustic feature to be input to predetermined acoustic model, to generate the text data.

8. device as claimed in claim 6, which is characterized in that the participle unit, comprising:

Subelement is segmented, for using the segmenting method based on probability statistics model to segment the text data.

9. a kind of computer equipment, including memory, processor and it is stored on the memory and can be on the processor The computer program of operation, which is characterized in that the processor realizes that claim 1-5 such as appoints when executing the computer program Natural language classification method described in one.

10. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage have one or More than one computer program, the one or more computer program can be by one or more than one processors It executes, to realize natural language classification method as described in any one in claim 1-5.