CN110287236A

CN110287236A - A kind of data digging method based on interview information, system and terminal device

Info

Publication number: CN110287236A
Application number: CN201910553409.9A
Authority: CN
Inventors: 邓悦; 金戈; 徐亮
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-06-25
Filing date: 2019-06-25
Publication date: 2019-09-27
Anticipated expiration: 2039-06-25
Also published as: CN110287236B

Abstract

The present invention is suitable for technical field of data processing, provides a kind of data digging method based on interview information, system and terminal device, and method includes: to obtain target corpus, and target corpus is arranged as M sentence；Convolutional neural networks CNN model is established according to sentence；Obtain the term vector matrix in CNN model；Using location information editor's term vector matrix, the term vector matrix with location information is obtained, and by the term vector matrix training CNN model with location information, so that the attribute word in CNN model output target corpus；According to attribute word, in target corpus, the interviewee with objective attribute target attribute is obtained.The rapid computations and accurate excavation to interview information may be implemented through the invention, screen interviewee corresponding with recruitment needs.

Description

A kind of data digging method based on interview information, system and terminal device

Technical field

The present invention relates to technical field of data processing more particularly to a kind of data digging method based on interview information, it is System and terminal device.

Background technique

It is particularly significant to the management and use of data in big data era.How effectively how data are collected from many aspects, Using existing data, how many income can be obtained in a large amount of data by determining.In mass data, can directly it obtain The major part for taking and using is text data, these data are related to all trades and professions of society.In face of the text of the scale of construction huge in this way Data, text classification are to handle the core means of text data, are had very in terms of the efficient management and use of text data Important meaning.For example, collecting the speech data of interviewee when being interviewed on a large scale, corpus is obtained, using corpus as base Plinth carries out text classification processing, can effectively extract key message, solve the problems, such as that interview information is mixed and disorderly, to facilitate HR retrieves the information of needs, screens interviewee.And the text categorization task for containing a large amount of corpus, usually use neural network Algorithm solves.

Currently, the neural network for being suitable for text classification is divided into CNN (Convolutional Neural Networks, volume Product neural network) and RNN (Recurrent Neural Network, Recognition with Recurrent Neural Network) two classes, wherein CNN ratio RNN operation Faster, the computing resource needed is less, however, being trained by CNN to corpus, can not capture the position of vocabulary in corpus Information is easy to influence the data mining accuracy rate of interview information；And although RNN can capture location information, it needs big The computing resource of amount, and there are problems that gradient disappearance, text classification result can be made error occur, equally influence interview information Data mining accuracy rate.

Summary of the invention

It is a primary object of the present invention to propose that a kind of data digging method based on interview information, system and terminal are set It is standby, to solve in the prior art, by the Application of Neural Network of text classification in the data mining of interview information, obtained text The problem of this classification resultant error is big, influences the data mining accuracy rate of interview information.

To achieve the above object, first aspect of the embodiment of the present invention provides a kind of data mining side based on interview information Method, comprising:

Target corpus is obtained, the target corpus is arranged as M sentence, wherein M is positive integer, and target corpus is face Try information；

Convolutional neural networks CNN model is established according to the sentence；

Obtain the term vector matrix in the CNN model；

Using term vector matrix described in location information editor, the term vector matrix with location information is obtained, and passes through institute The term vector matrix training CNN model with location information is stated, so that the CNN model exports in the target corpus Attribute word, wherein the attribute word includes the word with category attribute and the word with position attribution；

According to the attribute word, in the target corpus, the interviewee with objective attribute target attribute is obtained.

In conjunction with first aspect present invention, in first embodiment of the invention, the acquisition target corpus is whole by target corpus Reason is M sentence, comprising:

Predetermined word joint number is set；

It according to the predetermined word joint number, is intercepted in the target corpus, obtains the identical institute's predicate of M byte number Sentence.

It is described that convolutional Neural is established according to the sentence in second embodiment of the invention in conjunction with first aspect present invention Network C NN model, comprising:

I-th sentence is divided into N number of original word, and the original word is set as K dimensional vector, wherein i is Positive integer less than or equal to M, K and N are positive integer；

Based on i-th sentence, the CNN model of N × K is established.

In conjunction with first aspect present invention first embodiment and second embodiment, in third embodiment of the invention, when When byte number in the sentence is less than the predetermined word joint number, with 0 polishing；

When the dimension of the original word is less than K, with 0 polishing；

When the number of the original word is less than N, with 0 polishing.

In conjunction with first aspect present invention, in four embodiment of the invention, it is described using location information editor institute's predicate to Moment matrix obtains the term vector matrix with location information, and passes through the term vector matrix training institute with location information CNN model is stated, so that the CNN model exports the attribute word in the target corpus, comprising:

According to original word i-th sentence location information, by the original word be encoded to vector be spliced to word to Layer is measured, the term vector matrix with location information is obtained；

By the feature in the term vector matrix described in CNN model extraction with location information, i-th sentence is exported In attribute word；

According to the attribute word of sentence described in M item, the attribute word in the target corpus is obtained.

It is described according to original word in fifth embodiment of the invention in conjunction with the 4th embodiment of first aspect present invention In the location information of i-th sentence, the original word is encoded to vector and is spliced to term vector layer, obtained described with position The term vector matrix of information, comprising:

It obtains in i-th sentence, the type of j-th of original word, wherein j is just whole less than or equal to N Number；

When the type of j-th of original word is verb, j-th of original word, jth -1 original are obtained The location information of beginning word and jth+1 original word in i-th sentence；

By j-th of original word, jth -1 original word and jth+1 original word in i-th institute Location information in predicate sentence is encoded to vector and is spliced to the term vector layer, obtains the term vector with location information Matrix.

Second aspect of the embodiment of the present invention provides a kind of data digging system based on interview information, comprising:

Corpus sorting module arranges the target corpus for M sentence, wherein M is positive for obtaining target corpus Integer, target corpus are interview information；

Model construction module, for establishing convolutional neural networks CNN model according to the sentence；

Term vector obtains module, for obtaining the term vector matrix in the CNN model；

Attribute word obtains module, for using term vector matrix described in location information editor, obtaining to have location information Term vector matrix, and by the term vector matrix training CNN model with location information, so that the CNN mould Type exports the attribute word in the target corpus, wherein the attribute word includes having the word of category attribute and having The word of position attribution；

Destination selection module, for according to the attribute word, in the target corpus, obtaining to have objective attribute target attribute Interviewee.

In conjunction with second aspect of the present invention, in first embodiment of the invention, corpus sorting module includes:

Byte number setting unit, for predetermined word joint number to be arranged；

Sentence interception unit obtains M item for being intercepted in the target corpus according to the predetermined word joint number The identical sentence of byte number.

The third aspect of the embodiment of the present invention provides a kind of terminal device, including memory, processor and is stored in In above-mentioned memory and the computer program that can be run on above-mentioned processor, when above-mentioned processor executes above-mentioned computer program The step of realizing method provided by first aspect as above.

The fourth aspect of the embodiment of the present invention provides a kind of computer readable storage medium, above-mentioned computer-readable storage Media storage has computer program, and above-mentioned computer program realizes method provided by first aspect as above when being executed by processor The step of.

The embodiment of the present invention proposes a kind of data digging method based on interview information, and target corpus is divided into a plurality of language Sentence, then establishes convolutional neural networks CNN model using the sentence in target corpus, then obtain the term vector matrix of CNN model, Location information is added in term vector matrix, the term vector matrix with location information is obtained, uses the word with location information Vector matrix carries out classifying text task in CNN model, so that the result of CNN model output is including position in target corpus The attribute word of attribute and category attribute is set, at this time according to attribute word to target corpus, i.e. interview information carries out data mining, Recruitment needs can be corresponded to, the interviewee with corresponding objective attribute target attribute is obtained, wherein use the term vector square with location information When battle array carries out classifying text task in CNN model, feature extraction of the CNN model to word can be influenced by location information, So as to while capturing location information, promote the accuracy rate of text classification.

Detailed description of the invention

Fig. 1 is the implementation process schematic diagram for the data digging method based on interview information that the embodiment of the present invention one provides；

Fig. 2 is the implementation process schematic diagram of the data digging method provided by Embodiment 2 of the present invention based on interview information；

Fig. 3 is the detailed implementation process schematic diagram of step S1042 in Fig. 2；

Fig. 4 is the composed structure schematic diagram for the data digging system based on interview information that the embodiment of the present invention four provides.

The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.

Specific embodiment

It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not intended to limit the present invention.

It should be noted that, in this document, the terms "include", "comprise" or its any other variant are intended to non-row His property includes, so that the process, method, article or the device that include a series of elements not only include those elements, and And further include other elements that are not explicitly listed, or further include for this process, method, article or device institute it is intrinsic Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including being somebody's turn to do There is also other identical elements in the process, method of element, article or device.

Herein, using the suffix for indicating such as " module ", " component " or " unit " of element only for advantageous In explanation of the invention, there is no specific meanings for itself.Therefore, " module " can be used mixedly with " component ".

In subsequent description, inventive embodiments serial number is for illustration only, does not represent the advantages or disadvantages of the embodiments.

Embodiment one

As shown in Figure 1, the embodiment of the invention provides a kind of data digging method based on interview information, may be implemented pair The rapid computations of interview information and accurate excavation, screen interviewee corresponding with recruitment needs.In embodiments of the present invention, it is based on The data digging method of interview information may include:

S101, target corpus is obtained, the target corpus is arranged as M sentence.

Wherein, M is positive integer, and target corpus is interview information.

In above-mentioned steps S101, target corpus is the basic unit for constituting corpus, and corpus is usually expressed as textual data According to form, therefore target corpus is also the form of text data.

In a particular application, interview information can be the speech record based on interviewee, and acquisition modes can be with are as follows: in face The interview process of examination person is recorded, and the target corpus of form of textual data is then obtained by recording file, in target corpus Speech record including at least one interviewee.

In one embodiment, a kind of implementation of above-mentioned steps S101 can be with are as follows:

Predetermined word joint number is set；

In above-mentioned implementation, by optimizing the method for sorting of corpus, keep the byte number of every sentence identical, it can be with Improve the efficiency and accuracy of text classification.

S102, convolutional neural networks CNN model is established according to the sentence.

In one embodiment, a kind of implementation of above-mentioned steps S102 can be with are as follows:

Based on i-th sentence, the CNN model of N × K is established.

Wherein, original word derives from sentence, also the word to extract from target corpus.

In a particular application, original word is set to K dimensional vector, but in N number of original word, there are some original words Dimension is less than the case where the case where K or original word dimension are greater than K.And in CNN model, it can be by a hidden layer, it will The word of initial coding projects in a lower dimensional space, reduces the dimension of original word, therefore, the numerical value of K can be set as Greatest measure, to guarantee the uniformity of matrix, and the case where be not in that dimension is excessively high, influence arithmetic speed, wherein maximum number Value indicates to make the dimension of any original word to be less than K.

It similarly, is the uniformity for guaranteeing sentence, every sentence is divided into N number of original word, and the numerical value of N is set as maximum Numerical value, the original word marked off in i-th sentence are consistently less than N number of.

In conjunction with a kind of implementation of above-mentioned steps S101 and step S102, the embodiment of the present invention also proposes a kind of realization side Formula, dimension, the original word number and predetermined word joint number in sentence of unified original word.Implementation are as follows:

When the byte number in the sentence is less than the predetermined word joint number, with 0 polishing；

When the dimension of the original word is less than K, with 0 polishing；

When the number of the original word is less than N, with 0 polishing.

In a particular application, dimension, the word number in sentence, predetermined word joint number lacked with 0 polishing, to be aligned square Battle array reduces the computing resource of text classification convenient for establishing unified CNN model, improves computational efficiency.

Term vector matrix in S103, the acquisition CNN model.

CNN model in above-mentioned steps S102 neutralization procedure S103, can apply in image characteristics extraction, can also answer In text classification.And the embodiment of the present invention carries out text categorization task based on the sentence in target corpus, then establishes text The CNN model of classification, enables CNN model to handle each sentence, then the term vector that different sentences are exported by CNN model Matrix is different.

S104, using term vector matrix described in location information editor, obtain the term vector matrix with location information, and lead to The term vector matrix training CNN model with location information is crossed, so that the CNN model exports the target corpus In attribute word.

In above-mentioned steps S104, attribute word includes the word with category attribute and the word with position attribution； Location information is the positional relationship in sentence between each word, that is, the relationship between each original word hereinafter.

If the interview in target corpus including multiple interviewees records, the attribute word in target corpus comes from multiple faces Examination person.

S105, according to the attribute word, in the target corpus, obtain have objective attribute target attribute interviewee.

It, can be accurate since attribute word is the word with category attribute and position attribution in above-mentioned steps S105 Expression interviewee relevant information, reduce the mistake in semantic analysis, therefore sieved in target corpus according to attribute word When selecting interviewee corresponding with recruitment needs, the interviewee with objective attribute target attribute can be accurately found.

Target corpus is divided into a plurality of sentence by the data digging method provided in an embodiment of the present invention based on interview information, Then convolutional neural networks CNN model is established using the sentence in target corpus, then obtains the term vector matrix of CNN model, it will Location information be added term vector matrix in, obtain have location information term vector matrix, using the word with location information to Moment matrix carries out classifying text task in CNN model, so that the result of CNN model output is including position in target corpus The attribute word of attribute and category attribute, at this time according to attribute word to target corpus, i.e. interview information carries out data mining, can To correspond to recruitment needs, the interviewee with corresponding objective attribute target attribute is obtained, wherein use the term vector matrix with location information When carrying out classifying text task in CNN model, feature extraction of the CNN model to word can be influenced by location information, from And the accuracy rate of text classification can be promoted while capturing location information.

Embodiment two

As shown in Fig. 2, the embodiment of the present invention is illustrated the detailed implementation process of step S104 in embodiment one, it is above-mentioned A kind of implementation of step S104 are as follows:

S1041, according to original word i-th sentence location information, by the original word be encoded to vector splicing To term vector layer, the term vector matrix with location information is obtained.

In above-mentioned steps S1041, term vector matrix is a part of CNN model output, and i-th sentence is directly inputted It is trained in CNN model, then in the term vector matrix obtained, each term vector corresponds to an original word.

In embodiments of the present invention, it additionally provides and the location information of original word is encoded to vector, be spliced to term vector The process of layer:

Wherein, the location information of original word is encoded to vector, when being spliced to term vector layer, setting position information coding Weight is all 1, no bias term.

The process that original word is encoded to vector is schematically illustrated below:

On the basis of traditional textcnn, the location information of original word is encoded to the vector of 100 dimensions, is spliced to Simultaneously setting position information coding weight is all 1 to term vector layer:

PE (POS, 2i)=sin (pos/10000^ (2i/d presets dimension))

PE (POS, 2i+1)=cos (pos/10000^ (2i/d presets dimension))

Wherein, pos is position of the vocabulary in sentence, and i is i-th of dimension of position vector.

S1042, by described in CNN model extraction with location information term vector matrix in feature, export i-th institute Attribute word in predicate sentence.

In above-mentioned steps S1042, attribute word be by CNN model training after the completion of, have category attribute and position The word of attribute, while classification and the position of word are reflected, and the position attribution of word influences the category attribute of word.

In a particular application, if the term vector matrix that CNN model is directly constituted original word is trained, then instructing During white silk, location information of the original word in training matrix can be only obtained, original word cannot be directly obtained in sentence In location information, i.e. positional relationship in sentence between each word.

The attribute word of S1043, the sentence according to M item obtain the attribute word in the target corpus.

Above-mentioned steps S1043 is equivalent to M times and repeats step S1042, to obtain at most M × N number of category in target corpus Property word.

As shown in figure 3, the embodiment of the present invention also shows a kind of implementation of above-mentioned steps S1042, above-mentioned steps S1042 may include:

S10421, obtain i-th sentence in, the type of j-th of original word, wherein j be less than or equal to The positive integer of N.

In above-mentioned steps S10421, in Text Classification, the type of each original word can be straight in sentence It obtains, for example, noun shows as/n, verb shows as/v, and adjective shows as/adj, and preposition shows as/vj, wherein During CNN text classification, to reduce inessential text data, usual automatic fitration preposition adjective etc. is not needed point The text of class.

S10422, when the type of j-th of original word be verb when, obtain j-th original word, jth -1 The location information of a original word and jth+1 original word in i-th sentence.

In above-mentioned steps S10422, if the noun position before and after verb is unclear, and directly passes through CNN model training, To then mistake semantically be caused, for example, " the management in " in/school of management/study ", and " in/study/school of management " School " is noun, but meaning is not identical.

S10423, j-th of original word, jth -1 original word and jth+1 original word are existed Location information in i-th sentence is encoded to vector and is spliced to the term vector layer, has location information described in acquisition Term vector matrix.

S10421 to step S10423 through the above steps is executing the word for having location information by CNN model extraction Before feature in vector matrix, text data has been screened, has improved data mining efficiency.

The embodiment of the present invention illustrates provided by embodiment one and the embodiment of the present invention also for interviewing scene based on face Try the data digging method of information, effect in practical applications.

Where it is assumed that sorting out " I learns in school of management " this sentence from the corpus of A interviewee.

Firstly, sentence " I learns in school of management " is divided into 4 original words: " I " " " by step S1041 " school of management " " study ".

Then, it executes specific implementation provided by step S1042 and the embodiment of the present invention: obtaining the class of each word Type, wherein " study " is verb, and+1 original word of jth is not present, then directly acquires the location information a of " school of management ", The location information b of " study ", it is known that a=3, b=4 add position to original word in term vector matrix according to location information It include the original word with location information in term vector matrix after setting coding, it is assumed that its form of expression is " school of management₃", " study₄”。

Finally, carrying out step S1043, by the feature in CNN model extraction sentence " I learns in school of management ", have Original word " the school of management of location information₃", after being converted to attribute word, position attribution will affect categorical data, into When row feature extraction, " school of management₃" not as the feature of " study ", and by " school of management₃" feature as " school ", most " school of management is showed themselves in that in whole classification results₃" it is classified as educational background, without being classified as vocational skills.

Embodiment three

As shown in figure 4, the embodiment of the invention provides a kind of data digging system 40 based on interview information, using having The term vector matrix of location information carries out classifying text task in CNN model, to influence CNN model pair by location information The feature extraction of word is realized while capturing location information, promotes the accuracy rate of text classification, data digging system 40 Include:

Corpus sorting module 41 arranges target corpus for M sentence, wherein M is positive whole for obtaining target corpus Number, target corpus are interview information；

Model construction module 42, for establishing convolutional neural networks CNN model according to sentence；

Term vector obtains module 43, for obtaining the term vector matrix in CNN model；

Attribute word obtains module 44, and for using location information editor term vector matrix, obtaining has location information Term vector matrix, and by the term vector matrix training CNN model with location information, so that CNN model exports target corpus In attribute word, wherein attribute word includes the word with category attribute and the word with position attribution；

Destination selection module 45, in target corpus, obtaining the interview with objective attribute target attribute according to attribute word Person.

In one embodiment, corpus sorting module 41 may include:

Byte number setting unit, for predetermined word joint number to be arranged；

Sentence interception unit obtains M byte number phase for being intercepted in target corpus according to predetermined word joint number Same sentence.

In one embodiment, model construction module 42 may include:

Original word for i-th sentence to be divided into N number of original word, and is set as K dimension by original word division unit Vector, wherein i is the positive integer less than or equal to M, and K and N are positive integer；

CNN model construction unit establishes the CNN model of N × K for being based on i-th sentence.

The embodiment of the present invention also provide a kind of terminal device include memory, processor and storage on a memory and can be The computer program run on processor when the processor executes the computer program, is realized as described in embodiment one The data digging method based on interview information in each step.

The embodiment of the present invention also provides a kind of storage medium, and the storage medium is computer readable storage medium, thereon It is stored with computer program, when the computer program is executed by processor, is realized as described in embodiment one based on interview Each step in the data digging method of information.

Embodiment described above is merely illustrative of the technical solution of the present invention, rather than its limitations；Although previous embodiment Invention is explained in detail, those skilled in the art should understand that: it still can be to aforementioned each implementation Technical solution documented by example is modified or equivalent replacement of some of the technical features；And these modification or Replacement, the spirit and scope for technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution should all include Within protection scope of the present invention.

Claims

1. a kind of data digging method based on interview information characterized by comprising

Target corpus is obtained, the target corpus is arranged as M sentence, wherein M is positive integer, and target corpus is interview letter Breath；

Obtain the term vector matrix in the CNN model；

Using term vector matrix described in location information editor, the term vector matrix with location information is obtained, and passes through the tool There is the term vector matrix training CNN model of location information, so that the CNN model exports the attribute in the target corpus Word, wherein the attribute word includes the word with category attribute and the word with position attribution；

2. as described in claim 1 based on the data digging method of interview information, which is characterized in that the acquisition target language Material, target corpus is arranged as M sentence, comprising:

Predetermined word joint number is set；

It according to the predetermined word joint number, is intercepted in the target corpus, obtains the identical sentence of M byte number.

3. as described in claim 1 based on the data digging method of interview information, which is characterized in that described according to the sentence Establish convolutional neural networks CNN model, comprising:

I-th sentence is divided into N number of original word, and the original word is set as K dimensional vector, wherein i be less than Or the positive integer equal to M, K and N are positive integer；

Based on i-th sentence, the CNN model of N × K is established.

4. based on the data digging method of interview information as described in any one of Claims 2 or 3, which is characterized in that when institute's predicate When byte number in sentence is less than the predetermined word joint number, with 0 polishing；

When the dimension of the original word is less than K, with 0 polishing；

When the number of the original word is less than N, with 0 polishing.

5. as described in claim 1 based on the data digging method of interview information, which is characterized in that described to use location information Edit the term vector matrix, obtain the term vector matrix with location information, and by the word with location information to The moment matrix training CNN model, so that the CNN model exports the attribute word in the target corpus, comprising:

According to original word in the location information of i-th sentence, the original word is encoded to vector and is spliced to term vector layer, Obtain the term vector matrix with location information；

By the feature in the term vector matrix described in CNN model extraction with location information, export in i-th sentence Attribute word；

6. as claimed in claim 5 based on the data digging method of interview information, which is characterized in that described according to original word In the location information of i-th sentence, the original word is encoded to vector and is spliced to term vector layer, obtained described with position The term vector matrix of information, comprising:

It obtains in i-th sentence, the type of j-th of original word, wherein j is the positive integer less than or equal to N；

When the type of j-th of original word is verb, j-th of original word, jth -1 original list are obtained The location information of word and jth+1 original word in i-th sentence；

By j-th of original word, jth -1 original word and jth+1 original word in i-th institute's predicate Location information in sentence is encoded to vector and is spliced to the term vector layer, obtains the term vector matrix with location information.

7. a kind of data digging system based on interview information characterized by comprising

Corpus sorting module arranges the target corpus for M sentence for obtaining target corpus, wherein and M is positive integer, Target corpus is interview information；

Attribute word obtains module, for obtaining the word with location information using term vector matrix described in location information editor Vector matrix, and by the term vector matrix training CNN model with location information, so that the CNN model is defeated Attribute word in the target corpus out, wherein the attribute word includes having the word of category attribute and with position The word of attribute；

Destination selection module, in the target corpus, obtaining the interview with objective attribute target attribute according to the attribute word Person.

8. as claimed in claim 6 based on the data digging system of interview information, which is characterized in that corpus sorting module packet It includes:

Byte number setting unit, for predetermined word joint number to be arranged；

Sentence interception unit obtains M byte for being intercepted in the target corpus according to the predetermined word joint number The identical sentence of number.

9. a kind of terminal device, which is characterized in that on a memory and can be on a processor including memory, processor and storage The computer program of operation, which is characterized in that when the processor executes the computer program, realize such as claim 1 to 6 Each step in described in any item data digging methods based on interview information.

10. a kind of storage medium, the storage medium is computer readable storage medium, is stored thereon with computer program, It is characterized in that, when the computer program is executed by processor, realizes as claimed in any one of claims 1 to 6 based on interview Each step in the data digging method of information.