CN109272262B

CN109272262B - Method for analyzing natural language features

Info

Publication number: CN109272262B
Application number: CN201811422169.0A
Authority: CN
Inventors: 蒋万强; 龙诗娥; 侯健; 成鸿丰
Original assignee: Guangzhou Noobie Internet Technology Co ltd
Current assignee: Guangzhou Noobie Internet Technology Co ltd
Priority date: 2018-11-26
Filing date: 2018-11-26
Publication date: 2022-04-01
Anticipated expiration: 2038-11-26
Also published as: CN109272262A

Abstract

The invention discloses a method for analyzing natural language features, and relates to the field of language analysis. The method comprises the steps of obtaining natural language information of a tested person, and processing the natural language information to obtain processed language information; and determining the number of sentences included in the language information, and determining latitude information corresponding to each sentence by a natural language feature analysis method.

Description

Method for analyzing natural language features

Technical Field

The invention relates to the field of language analysis, in particular to a method for analyzing natural language features.

Background

Natural Language is a way for human beings to communicate with each other, and Natural Language Processing (NLP) is broadly defined as automatic analysis Processing and operation of Natural Language such as voice and text by software. The most common natural language processing applications include text reading, speech synthesis, speech recognition, automatic Chinese word segmentation, part-of-speech tagging, syntactic analysis, natural language generation, text classification, information retrieval, information extraction, word collation, question-answering systems, machine translation, automatic summarization, word implications, and the like.

The nature of talent evaluation analysis is a structured theory and data model, the basis of natural language processing is data and algorithm, the theoretical model of talent evaluation is endowed with natural language processing analysis, and talent evaluation analysis is completed under the drive of big data.

The traditional talent evaluation analysis generally uses a written exam, an expert interview, a scale, situation simulation, system simulation and the like to complete the evaluation analysis on people. Wherein, the written examination is to draw up examination questions according to the working property, condition requirement and necessary theoretical knowledge of job duty of the candidate object to be worked on, and the tested person is allowed to perform written examination; the expert interviewing comprises interviewing, answering and the like, and the chief examiner directly faces the examinee and obtains the evaluated examination method in a language expression or actual operation mode; the appraisal performance is that the ability of talents is reflected through the actual performance, and meanwhile, the appraised object writes a report, and a democratic review is performed by the subordinate and peer with great work relevance with the appraised object; the scale is to decompose the quality of a person into a plurality of elements to form a standard evaluation system, and then ask for superior leaders, peer employees and the person to score according to the standard, and form the evaluation of the person through summary analysis; the situation simulation is to place the testee in a simulated working situation, and observe and evaluate the psychology and ability of the testee in the simulated working situation by adopting various evaluation technologies; the system simulation is that the testee is put in a dynamic model which is formed by a computer and is close to an actual system, so that the testee plays a certain role, the human-computer conversation mode is adopted for working, and the computer predicts various potentials of the testee according to the working behavior and the actual results of the testee in the specified total time; the video off-line analysis is to record the working video of the tested person in the working scene, and the expert carries out post-artificial labeling analysis on the recorded video to form evaluation on one person.

The above-mentioned many kinds of examination are carried on the premise that the person who is tested and the person who is tested know the purpose of the test, there is the question that person who is tested and person who is tested depend on the test method excessively.

Disclosure of Invention

The embodiment of the invention provides a natural language feature analysis method, which is used for solving the problem that testers and testees depend on a test method excessively in the prior art.

The embodiment of the invention provides a method for analyzing natural language features, which comprises the following steps: acquiring natural language information of a tested person, and processing the natural language information to obtain processed language information;

and determining the number of sentences included in the language information, and determining latitude information corresponding to each sentence by a natural language feature analysis method.

Preferably, the latitude information includes an intelligibility degree;

the determining the intelligibility degree of each sentence through a natural language feature method comprises the following steps:

performing word segmentation or word segmentation on a first sentence included in the language information;

determining a shannon information amount Si of each valid participle or word included in the first sentence by the formula Si ═ logPi;

by the formula

Determining an average shannon information content of the effective participles of the first sentence;

wherein log is a logarithmic function based on 2, 10, or a natural number e, Pi is the probability of occurrence of each valid participle or word, N is the total number of valid participles or words of the first sentence,

is the Shannon information of all effective participles or characters included in the first sentenceThe quantity, S, represents the intelligibility of the first sentence.

Preferably, the latitude information includes a concentration degree;

the determining the concentration degree corresponding to each sentence through a natural language feature method comprises the following steps:

determining a first sentence included in the language information as s _ t, and determining a plurality of sentences preceding the first sentence as s _ 0.

Using the s _0, the., s _ { t-1} as an input of a neural network model of CNN, RNN and Transformer, and using an output of the neural network model as a vector representation of the whole of the s _0, the., s _ { t-1 };

taking the s _ t as the input of a neural network model of CNN, RNN and Transformer, and taking the output of the neural network model as the vector representation of the s _ t;

determining C as the vector representation of the s _0, the., s _ { T-1} integer, determining T as the vector representation of s _ T, and calculating the concentration degree of C and T by inputting C and T into CNN, RNN and Transformer or determining the cosine of the included angle between C and T as the concentration degree of the first sentence.

Preferably, the latitude information includes a concentration degree;

By establishing a corpus, marking the degree of correlation of two sections of characters, and training a neural network model (CNN, RNN or Transformer) by using the corpus;

and inputting the s _ t and s _0, the s _ { t-1} as two sections of characters into a trained neural network model, and obtaining the correlation degree of the two sections of characters as the concentration degree of the first sentence.

Preferably, the latitude information includes a concentration degree;

Confirming the average value of each word vector included in the first sentence as the vector representation of the first sentence;

determining an average value of the s _ 0., _{ t-1} sentence vector as a vector representation of the s _ 0., _{ t-1} whole;

representing the vector of the s _ t as the input of a neural network model of CNN, RNN and Transformer, and representing the output of the neural network model as the vector of the s _ t;

determining C as the vector representation of the s _0, the., _{ T-1} entirety, determining T as the vector representation of s _ T, and determining the cosine of the included angle between C and T as the concentration degree of the first sentence.

Preferably, the latitude information includes a semantic richness degree:

the determining the semantic richness degree corresponding to each sentence through a natural language feature method comprises the following steps:

segmenting a first sentence included in the language information, confirming that n effective segments are included in the first sentence, and determining the ith effective segment as w _ i;

obtaining the word vector of the w _ i through a common word vector model, and determining the word vector of the w _ i as e _ i;

determining an average word vector of word vectors of n valid participles included in a first sentence, and determining the average word vector as mu, wherein mu is (e _0+ ·+ e _ n)/n;

determining Euclid distances between word vectors of all words included in the first sentence and the average word vector mu, and determining the standard deviation of the Euclid distances as the semantic richness of the first sentence.

Preferably, the latitude information includes a semantic richness degree:

determining cosine similarity between word vectors of all words included in the first sentence and the average word vector mu, and determining standard deviation of the cosine similarity as semantic richness of the first sentence.

Preferably, the latitude information includes a degree of difficulty of change;

the determining the difficulty variation degree corresponding to each sentence through a natural language feature method comprises the following steps:

the first sentence included in the language information is divided into n words or characters,

determining the shannon information quantity Si of each effective participle or word through a formula Si-logPi;

determining the difficulty and the ease change degree of the first sentence through a formula sqrt (var);

where log is a logarithmic function based on 2, 10, or a natural number e, Pi is the probability of occurrence of each valid participle or word, var mean ((S _ i-mu) ^2, i), mu mean (S _ i, i), mean (x _ i, i) means averaging x _ 0.

Preferably, the language information includes voice information and text information.

The embodiment of the invention provides a method for analyzing natural language features, which comprises the following steps: acquiring natural language information of a tested person, and processing the natural language information to obtain processed language information; and determining the number of sentences included in the language information, and determining latitude information corresponding to each sentence by a natural language feature analysis method. The method obtains the natural language information of the tested person in a natural working state without changing the daily working behavior habit of the tested person, and carries out evaluation analysis on the obtained natural language information of the tested person by a natural language characteristic method, and has the following advantages: data acquisition is more natural, and test and analysis result distortion caused by subjective factors of a tested person or test environment pressure is reduced; the data analysis is more objective, and the test and analysis result distortion caused by subjective factors of experts or teachers to be evaluated by other traditional methods is reduced; the data analysis is more real-time, and the voice data of the tested person is instantly collected, processed and displayed in real time through the natural language real-time analysis device, and the tested person can be obtained. The low efficiency method that a large amount of manpower, material resources and time are consumed in the traditional test method is changed; the data analysis is more scientific, the modeling of the data analysis is based on the analysis of big data, the data analysis is continuously optimized by an artificial intelligence method, and the data analysis result is more and more accurate along with the increase of the number of testers. The method can efficiently finish evaluation analysis on the tested person in real time, improve the efficiency of traditional talent evaluation, reduce the degree of dependence on an unobtrusive expert system, and reduce the subjectivity of experts and the tested person in the traditional talent evaluation method and the reduction of the reliability and the effectiveness of evaluation results caused by factors such as the psychological pressure of the evaluation environment and the like. The problem of exist the person of testing and the person of being tested excessively rely on to test system among the prior art is solved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flow chart of a method for analyzing natural language features according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 exemplarily shows a flow chart of an analysis method for natural language features provided by an embodiment of the present invention, as shown in fig. 1, the method mainly includes the following steps:

step 101, acquiring natural language information of a tested person, and processing the natural language information to obtain processed language information;

step 102, determining the number of sentences included in the language information, and determining latitude information corresponding to each sentence through a natural language feature analysis method.

In step 101, the voice information of the tested person may be acquired through a specific voice recording device, and furthermore, the text information of the tested person may be acquired through a text recording device. It should be noted that, in the embodiment of the present invention, a specific method for acquiring the voice information and the text information of the tested person is not limited.

Further, the acquired natural language information is processed, that is, the text information included in the natural language information is classified into one category, and the other category of voice information included in the natural language information is classified into another category. In the embodiment of the present invention, the specific method for classifying the text information and the voice information is not limited.

In step 102, the number of sentences corresponding to each test rule in the text information or the voice information is determined, and then the latitude information corresponding to each sentence is sequentially analyzed by a natural language feature analysis method.

In order to clearly describe the analysis method of the natural language features provided by the embodiment of the present invention, the following takes the analysis of the first sentence as an example to describe the analysis method in detail. It should be noted that the first sentence herein does not represent the first sentence in the information expressed by the subject, and the first sentence herein may represent any sentence in the information expressed by the subject.

In the embodiment of the invention, the latitude information corresponding to the first sentence mainly comprises comprehensibility degree, concentration degree, semantic abundance degree and difficulty and easiness change degree.

Specifically, the method for determining the intelligibility degree of the first sentence through the natural language feature method comprises the following steps:

step 201, performing word segmentation or word segmentation on a first sentence included in the language information;

step 202, determining shannon information quantity Si of each effective participle or word included in the first sentence through a formula Si ═ -logPi;

step 203, passing the formula

Determining an average shannon information amount of effective participles of a first sentence;

in the above steps 202 and 203, log is a logarithmic function based on 2, 10, or a natural number e, Pi is the probability of each valid participle or word occurring, N is the total number of valid participles or words of the first sentence,

is the shannon information content of all valid participles or words comprised by the first sentence, S represents the intelligibility of the first sentence.

For example, the intelligibility is calculated for each sentence included in the language information, and the specific steps are as follows:

1.1 performing word segmentation or word segmentation on the first sentence.

1.2 calculate the shannon information quantity Si of each valid participle or word (called Wi) included in the first sentence, with the formula Si ═ -logPi, where log is a logarithmic function that can be based on 2, 10, or a natural number e, and Pi is the probability of occurrence of the word or word, i.e. the word frequency.

1.3 calculate the first sentenceAverage shannon information quantity S of effective participles of son, calculation formula

and/N. N is the number of valid word segments or the number of words of the sentence,

and calculating the shannon information quantity of all effective participles or characters in the sentence, wherein/represents the division.

1.4S indicates the intelligibility of the sentence.

Specifically, the concentration degree corresponding to the first sentence is determined by a natural language feature method, which mainly includes the following methods:

the first method comprises the following steps:

step 301-1, determining a first sentence included in the language information as s _ t, and determining a plurality of sentences before the first sentence as s _ 0., s _ { t-1 };

step 301-2, using s _ 0., _{ t-1} as input of neural network model of CNN, RNN and transform, and using output of neural network model as vector representation of s _ 0., _{ t-1} whole;

step 301-3, taking s _ t as input of neural network models of CNN, RNN and Transformer, and taking output of the neural network models as vector representation of s _ t;

step 301-4, determine C as s _ 0., s _ { T-1} overall vector representation, and determine T as s _ T vector representation, input C and T CNN, RNN, Transformer calculate the concentration degree of C and T or determine the cosine of the angle between C and T as the concentration degree of the first sentence.

The second method comprises the following steps:

step 302-1, determining a first sentence included in the language information as s _ t, and determining a plurality of sentences before the first sentence as s _ 0., s _ { t-1 };

step 302-2, marking the degree of correlation of the two sections of characters by establishing a corpus, and training a neural network model of CNN, RNN or Transformer by using the corpus;

step 302-3, inputting the trained neural network model with s _ t and s _ 0., s _ { t-1} as two sections of characters, and obtaining the degree of correlation of the two sections of characters as the concentration degree of the first sentence.

The third method comprises the following steps:

step 303-1, determining a first sentence included in the language information as s _ t, and determining a plurality of sentences before the first sentence as s _ 0., s _ { t-1 };

step 303-2, determining the average value of each word vector included in the first sentence as the vector representation of the first sentence;

step 303-3, determining an average value of vectors of the s _ 0., _{ t-1} sentences as a vector representation of the s _ 0., s _ { t-1} entirety;

step 303-4, representing the vector of s _ t as the input of the neural network model of CNN, RNN and Transformer, and representing the output of the neural network model as the vector of s _ t;

step 303-5, determine C as a vector representation of s _ 0., s _ { T-1} in its entirety, determine T as a vector representation of s _ T, and determine the cosine of the angle between C and T as the concentration degree of the first sentence.

For example, given the first sentence as s _ t and the sentence or sentences preceding the first sentence as s _ 0.

2.1 for each iin {0.. t }, the vector representation of the first sentence s _ i is computed as follows:

2.1.1 performing word segmentation on the s _ i, and removing stop words according to actual needs;

2.1.2 using the segmented sequence as input of a neural network model including but not limited to CNN, RNN and Transformer, and the output is used as vector representation of s _ i; the word vector can also be simply averaged as the vector representation of the sentence;

2.2 obtain a vector representation of the entirety of s _ 0., s _ { t-1}, which can be used in two ways:

2.2.1 represent vectors for s _ 0., _{ t-1} individually as inputs to neural network models including, but not limited to, CNN, RNN, Transformer, with their outputs as vector representations for s _ 0., s _ { t-1} in their entirety;

2.2.2 can also simply average the vectors of s _ 0., _{ t-1}, as a vector representation of its entirety;

2.3 if 2.2.1 is adopted, the vector representation of s _ t is also subjected to the same network transformation in 2.2.1, and the output is used as a new vector representation of s _ t; if 2.2.2 is adopted, no operation is needed;

2.4, calculating similarity, wherein C is expressed by a vector of s _ 0., _{ T-1}, and T is expressed by a vector of s _ T, then the cosine of an included angle between C and T is defined as the similarity, and the formula is as follows: < C, T >/(norm (C) norm (T)), where < x, y > represents the inner product and norm is the norm of L2.

Specifically, determining the semantic richness corresponding to the first sentence by using a natural language feature method mainly includes the following two methods:

the first method comprises the following steps:

step 401-1, performing word segmentation on a first sentence included in the language information, determining that n effective word segments are included in the first sentence, and determining the ith effective word segment as w _ i;

step 401-2, obtaining a word vector of w _ i through a common word vector model, and determining the word vector of w _ i as e _ i;

step 401-3, determining an average word vector of word vectors of n valid participles included in the first sentence, and determining the average word vector as mu, where mu is (e _0+ ·+ e _ n)/n;

and step 401-4, determining Euclid distances between word vectors of all words included in the first sentence and the average word vector mu, and determining the standard deviation of the Euclid distances as the semantic richness of the first sentence.

The second method comprises the following steps:

step 402-1, performing word segmentation on a first sentence included in the language information, confirming that n effective word segments are included in the first sentence, and determining the ith effective word segment as w _ i;

step 402-2, obtaining a word vector of w _ i through a common word vector model, and determining the word vector of w _ i as e _ i;

step 402-3, determining an average word vector of word vectors of n effective participles included in the first sentence, and determining the average word vector as mu, wherein mu is (e _0+ ·+ e _ n)/n;

step 402-4, determining cosine similarity between word vectors of all words included in the first sentence and the average word vector mu, and determining standard deviation of the cosine similarity as semantic richness of the first sentence.

For example, the semantic richness is calculated for the first sentence, and the specific steps are as follows:

3.1, segmenting the first sentence, removing stop words, remaining effective segmentation, recording the ith word as w _ i, and setting n effective segmentation;

3.2 obtaining a word vector of w _ i by using a common word vector model, and marking as e _ i;

3.2 calculate the average word vector of all word vectors of the first sentence valid participles, denoted as mu, (e _0+ ·+ e _ n)/n;

3.3 calculating the distance between the word vector of each word in the first sentence and mu, wherein the distance can be Euclid distance or cosine similarity, and the calculation result corresponding to w _ i is recorded as d _ i;

3.4 the standard deviation of this distance is defined as the semantic richness of the first sentence, and the formula is: sqrt (var), where var ═ mean ((d _ i-md) ^2, i), md ═ mean (d _ i, i), mean (x _ i, i) denotes averaging x _ 0.

Specifically, the method for determining the difficulty variation degree corresponding to the first sentence through the natural language feature method mainly comprises the following steps:

step 501, performing word segmentation or word segmentation on a first sentence included in the language information, wherein n word segments or words are provided in total,

step 502, determining the shannon information quantity Si of each effective participle or word by a formula Si ═ logPi;

step 503, determining the difficulty variation degree of the first sentence through a formula sqrt (var);

in steps 502 and 503, log is a logarithmic function based on 2, 10, or a natural number e, Pi is the probability of occurrence of each valid participle or word, var mean ((S _ i-mu) ^2, i), mu mean (S _ i, i), mean (x _ i, i) means averaging x _ 0.

For example, the difficulty variation degree of the first sentence is calculated by the following specific steps:

4.1, dividing words or characters for the first sentence, wherein n are provided;

4.2 calculating the shannon information content Si of each valid participle or word included in the first sentence, which has the formula Si ═ logPi, where log is a logarithmic function that can be based on 2, 10, or a natural number e, and Pi is the probability of occurrence of the word or word, i.e. the word frequency;

4.3S _ 0., the standard deviation of S _ n is defined as the degree of difficulty and ease of the first sentence, and the formula is: sqrt (var), where var ═ mean ((S _ i-mu) ^2, i), mu ═ mean (S _ i, i), mean (x _ i, i) denotes averaging x _ 0.

In summary, an embodiment of the present invention provides a method for analyzing natural language features, including: acquiring natural language information of a tested person, and processing the natural language information to obtain processed language information; and determining the number of sentences included in the language information, and determining latitude information corresponding to each sentence by a natural language feature analysis method. The method obtains the natural language information of the tested person in a natural working state without changing the daily working behavior habit of the tested person, and carries out evaluation analysis on the obtained natural language information of the tested person by a natural language characteristic method, and has the following advantages: data acquisition is more natural, and test and analysis result distortion caused by subjective factors of a tested person or test environment pressure is reduced; the data analysis is more objective, and the test and analysis result distortion caused by subjective factors of experts or teachers to be evaluated by other traditional methods is reduced; the data analysis is more real-time, and the voice data of the tested person is instantly collected, processed and displayed in real time through the natural language real-time analysis device, and the tested person can be obtained. The low efficiency method that a large amount of manpower, material resources and time are consumed in the traditional test method is changed; the data analysis is more scientific, the modeling of the data analysis is based on the analysis of big data, the data analysis is continuously optimized by an artificial intelligence method, and the data analysis result is more and more accurate along with the increase of the number of testers. The method can efficiently finish evaluation analysis on the tested person in real time, improve the efficiency of traditional talent evaluation, reduce the degree of dependence on an unobtrusive expert system, and reduce the subjectivity of experts and the tested person in the traditional talent evaluation method and the reduction of the reliability and the effectiveness of evaluation results caused by factors such as the psychological pressure of the evaluation environment and the like. The problem of exist the person of testing and the person of being tested excessively rely on to test system among the prior art is solved.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method for analyzing natural language features, comprising:

acquiring natural language information of a tested person, and processing the natural language information to obtain processed language information;

determining the number of sentences included in the language information, and determining the corresponding dimension information of each sentence through a natural language feature analysis method;

wherein the dimension information includes a concentration level;

2. The method of claim 1, wherein the dimensional information includes a degree of intelligibility;

by the formula

is the shannon information content of all valid participles or characters comprised by said first sentence, S represents the intelligibility of said first sentence.

3. The method of claim 1, in which the dimensional information comprises a concentration level;

4. The method of claim 1, in which the dimensional information comprises a concentration level;

5. The method of claim 1, wherein the dimension information comprises a semantic richness:

6. The method of claim 1, wherein the dimension information comprises a semantic richness:

7. The method of claim 1, wherein the dimensional information includes a degree of difficulty;

carrying out word segmentation or word segmentation on a first sentence included in the language information, wherein n word segments or words are shared,

8. The method of any one of claims 1 to 7, wherein the language information comprises voice information and text information.