CN112487800A

CN112487800A - Text processing method, device, server and storage medium

Info

Publication number: CN112487800A
Application number: CN201910773380.5A
Authority: CN
Inventors: 赵玲; 刘国岭; 柯俞嘉; 王振蒙; 王艺之; 张英驰; 董珊
Original assignee: SF Technology Co Ltd
Current assignee: SF Technology Co Ltd
Priority date: 2019-08-21
Filing date: 2019-08-21
Publication date: 2021-03-12
Anticipated expiration: 2039-08-21
Also published as: CN112487800B

Abstract

The embodiment of the application discloses a text processing method, a text processing device, a server and a storage medium, and the text processing method, the text processing device, the server and the storage medium can acquire a text to be processed; performing word segmentation processing on the text according to a preset strategy to obtain words forming the text; extracting features of the text according to the words to obtain feature vectors; determining the type of the text through a multi-level cascaded classification model based on the feature vector; and determining reply information corresponding to the text according to the type of the text. According to the scheme, the type of the text can be accurately determined through fusion of the multi-stage cascade classification models, the corresponding reply information can be quickly positioned according to the type of the text, and the accuracy and efficiency of the reply are improved.

Description

Text processing method, device, server and storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a text processing method, apparatus, server, and storage medium.

Background

At present, in the recruitment question-and-answer work, an administrative staff in charge of recruitment needs to deal with a large number of repeated problems of an applicant, the recruitment work and the like, and the labor cost is greatly wasted. Or, the intelligent customer service terminal answers the relevant questions, generally, the intelligent customer service terminal answers the questions firstly analyzes semantic information of the questions, searches relevant answers from corpus content of the local corpus according to the semantic information, and simple semantic matching enables some questions not to be obtained from the corpus, or the searched answers and the questions have low correlation, namely, the accuracy is low, so that answers provided by the intelligent customer service terminal cannot completely meet user requirements.

Disclosure of Invention

The embodiment of the application provides a text processing method, a text processing device, a server and a storage medium, which can improve the accuracy and efficiency of text processing.

In a first aspect, an embodiment of the present application provides a text processing method, including:

acquiring a text to be processed;

performing word segmentation processing on the text according to a preset strategy to obtain words forming the text;

extracting features of the text according to the words to obtain feature vectors;

determining the type of the text through a multi-level cascaded classification model based on the feature vector;

and determining reply information corresponding to the text according to the type of the text.

In some embodiments, the multi-stage cascaded classification model includes a support vector machine model, a random forest model, a logistic regression model, a text classification model, and an extreme gradient boost model, and the determining the type of the text by the multi-stage cascaded classification model based on the feature vector includes:

respectively obtaining predicted values of the type of the text through a first-stage support vector machine model, a random forest model, a logistic regression model and a text classification model based on the feature vectors;

and determining the type of the text through an extreme gradient lifting model of a second level according to the predicted value.

In some embodiments, before obtaining the predicted value of the type of the text based on the feature vector through a support vector machine model, a random forest model, a logistic regression model, and a text classification model of a first stage, respectively, the method further includes:

obtaining a training sample;

carrying out augmentation processing of near-synonym replacement, dual translation and/or related search on the training sample to obtain a processed training sample;

performing word segmentation processing on the processed training sample according to the preset strategy to generate a word set containing a plurality of words;

obtaining a sample feature vector of the processed training sample according to the word set;

and training a support vector machine model, a random forest model, a logistic regression model, a text classification model and an extreme gradient lifting model according to the sample feature vector.

In some embodiments, the obtaining the sample feature vector of the processed training sample from the set of words comprises:

obtaining the frequency of each word in the word set existing in each processed training sample;

screening out training samples containing the words in the word set from the processed training samples to obtain target training samples;

obtaining the reverse text frequency of the target training sample in the processed training sample;

generating a parameter corresponding to each word according to the frequency and the reverse text frequency;

and generating a sample feature vector according to the parameter corresponding to each word.

In some embodiments, the performing an augmentation process of near-sense word replacement on the training samples, and obtaining the processed training samples includes:

performing word segmentation processing on the training sample to obtain a plurality of words;

searching words with similarity greater than a preset threshold value with each word from a near-sense word library to obtain candidate words;

and carrying out permutation and combination on the candidate words to generate a processed training sample.

In some embodiments, performing an augmentation process on the dual translation on the training samples, and obtaining the processed training samples includes:

acquiring the current language type of the training sample;

translating the training sample from the current language type to a target language type to obtain a translated training sample;

and translating the translated training sample from the target language type to the current language type to obtain a processed training sample.

In some embodiments, the determining reply information corresponding to the text according to the type of the text includes:

when the type of the text is a first type, obtaining sentence similarity and word similarity between the text and a standard question-answer pair in a preset database based on the characteristic vector;

determining the score of each standard question-answer pair according to the sentence similarity and the word similarity;

and taking the response of the standard question-answer pair with the highest score as the response information corresponding to the text.

when the type of the text is a second type, acquiring the language type of the text;

when the translation is determined to be needed according to the language type of the text, translating the text to obtain a translated text;

and acquiring response information matched with the translated text from a preset corpus.

In a second aspect, an embodiment of the present application further provides a text processing apparatus, including:

the receiving module is used for acquiring a text to be processed;

the word segmentation module is used for carrying out word segmentation processing on the text according to a preset strategy to obtain words forming the text;

the extraction module is used for extracting the features of the text according to the words to obtain feature vectors;

a determination module for determining the type of the text through a multi-level cascaded classification model based on the feature vector;

and the feedback module is used for determining the reply information corresponding to the text according to the type of the text.

In some embodiments, the classification models of the multistage cascade include a support vector machine model, a random forest model, a logistic regression model, a text classification model, and an extreme gradient boost model, and the determination module is specifically configured to:

In some embodiments, the text processing apparatus further comprises:

the first acquisition module is used for acquiring a training sample;

the processing module is used for carrying out augmentation processing of synonym replacement, dual translation and/or related search on the training sample to obtain a processed training sample;

the generating module is used for carrying out word segmentation processing on the processed training sample according to the preset strategy to generate a word set containing a plurality of words;

the second acquisition module is used for acquiring the sample characteristic vector of the processed training sample according to the word set;

and the training module is used for training a support vector machine model, a random forest model, a logistic regression model, a text classification model and an extreme gradient lifting model according to the sample feature vector.

In some embodiments, the second obtaining module is specifically configured to:

In some embodiments, the processing module is specifically configured to:

acquiring the current language type of the training sample;

In some embodiments, the feedback module is specifically configured to:

In a third aspect, an embodiment of the present application further provides a server, which includes a memory and a processor, where the memory stores a computer program, and the processor executes any one of the text processing methods provided in the embodiment of the present application when calling the computer program in the memory.

In a fourth aspect, an embodiment of the present application further provides a storage medium, where the storage medium is used to store a computer program, and the computer program is loaded by a processor to execute any one of the text processing methods provided in the embodiment of the present application.

The method and the device for processing the text can obtain the text to be processed, perform word segmentation processing on the text according to a preset strategy to obtain words forming the text, perform feature extraction on the text according to the words to obtain feature vectors, determine the type of the text through a multi-stage cascade classification model based on the feature vectors, and determine the reply information corresponding to the text according to the type of the text. According to the scheme, the type of the text can be accurately determined through fusion of the multi-stage cascade classification models, the corresponding reply information can be quickly positioned according to the type of the text, and the accuracy and efficiency of information reply of text processing are improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flowchart of a text processing method provided in an embodiment of the present application;

fig. 2 is a schematic flow chart of data augmentation processing provided in the embodiment of the present application;

FIG. 3 is another schematic flow chart diagram of a text processing method provided in an embodiment of the present application;

FIG. 4 is another schematic flow chart diagram of a text processing method provided in the embodiment of the present application;

FIG. 5 is a schematic flow chart of an intent classification process provided by an embodiment of the present application;

FIG. 6 is a schematic flow chart of a QA question-answer provided by an embodiment of the present application;

fig. 7 is a schematic flowchart of chatting provided in an embodiment of the present application;

FIG. 8 is a schematic structural diagram of a text processing apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1, fig. 1 is a schematic flow chart illustrating a text processing method according to an embodiment of the present application. The main body of the text processing method may be the text processing apparatus provided in the embodiments of the present application, or a server integrated with the text processing apparatus, where the text processing apparatus may be implemented in a hardware or software manner. The text processing method may include:

s101, obtaining a text to be processed.

For example, a text to be processed may be extracted from text data, or a text to be processed sent by a client may be received, where the client may be a client on a terminal such as a mobile phone or a computer, and the text processing apparatus may receive a text sent by a user through a client preset on the terminal. The text may include characters, letters, numbers, punctuation marks, and the like, and the text may be a problem initiated in a recruitment process, or a problem initiated in a publicity process, or a problem initiated in an operation process, or a problem initiated in a task process, or a problem initiated in a shopping process, or a problem initiated in a chat, or a freight consultation problem initiated by sending an express delivery, or other types of text, and specific contents are not limited herein.

S102, performing word segmentation processing on the text according to a preset strategy to obtain words forming the text.

After the text is received, in order to facilitate feature extraction of the text, word segmentation processing may be performed on the text in advance, specifically, word segmentation processing may be performed on the text according to a preset policy, the preset policy may be flexibly set according to actual needs, and specific content is not limited here.

For example, the preset strategy may be word segmentation according to a preset number of words per interval, for example, 2 words per interval are segmented into one word, or 3 words per interval are segmented into one word. The preset strategy may also be to divide words evenly according to the total word number of the text, for example, when the total word number of a certain text is 15, every 5 words may be divided into one word equally. The preset strategy can also be random word segmentation, for example, when the total number of words of a certain text is 15, only 3 groups of words consisting of 2 words are extracted from the text; or, the text with 15 total words is divided into a word composed of 2 words, a word composed of 1 word, a word composed of 9 words and a word composed of 3 words. The preset strategy may also be word segmentation according to words in a dictionary or a lexicon, for example, segmenting "robot" into one word, segmenting "patent" into one word, and so on. The preset strategy can also be artificial intelligence based word segmentation and the like.

After the text is participled, one or more words, which may include one or more words, composing the text may be generated.

And S103, extracting the features of the text according to the words to obtain feature vectors.

After the text is subjected to word segmentation processing to generate one or more words, parameters corresponding to each word can be obtained, wherein one parameter is correspondingly used for identifying one word, and the parameter can be a number or a character which uniquely identifies the word. For example, "we" corresponds to a parameter of 0.1, "i" corresponds to a parameter of 0.5, etc.

At this time, feature extraction may be performed on the text according to words constituting the text, where the feature may be a keyword or a word with a specific meaning, and a feature vector is generated based on the extracted keyword or the word with the specific meaning, and the corresponding parameter.

And S104, determining the type of the text through a multi-stage cascading classification model based on the feature vector.

The classification models of the multi-level cascade can be flexibly set according to actual needs, for example, the classification models of the multi-level cascade include a Support Vector Machine (SVM) model, a random forest model, a logistic regression model, a text classification (FastText) model, an eXtreme Gradient Boosting (XGBoost) model, and the like, the Support Vector Machine model, the random forest model, the logistic regression model, and the text classification model form a first-level model, the eXtreme Gradient Boosting model is a second-level model, and the first-level model and the second-level model are cascaded.

In some embodiments, determining the type of text through a multi-level cascaded classification model based on the feature vectors may include: respectively obtaining predicted values of the types of the texts through a first-stage support vector machine model, a random forest model, a logistic regression model and a text classification model based on the feature vectors; and determining the type of the text through an extreme gradient lifting model of the second stage according to the predicted value.

In order to improve the accuracy and efficiency of text type determination, the type of the text may be determined through a multi-level cascaded classification model, which is a trained model, wherein the type of the text may be flexibly set according to actual needs, for example, the type of the text may include an operation type, a consultation type, a question-and-answer type, a chatting type, or the like. The feature vector may be input into a first-level model, a first predicted value of a type of the text may be obtained through a first-level SVM model based on the feature vector, a second predicted value of the type of the text may be obtained through a first-level random forest model, a third predicted value of the type of the text may be obtained through a first-level logistic regression model, and a fourth predicted value of the type of the text may be obtained through a first-level FastText model, the first predicted value, the second predicted value, the third predicted value, and the fourth predicted value being included in the predicted values. After the four predicted values are obtained, the type of the text can be determined through the second-level XGboost model based on the first predicted value, the second predicted value, the third predicted value and the fourth predicted value, so that the current intention of the user can be predicted through the text. The stacking combination of different models is used in the process of intent classification, and the accuracy and generalization capability of intent classification are improved.

Compared with the classification model based on rule matching, the classification model based on multistage cascade does not need a large amount of manual production and maintenance rules, and has stronger generalization capability; compared with a deep learning classification algorithm, the method has the advantages that the required data volume is smaller, the training speed is higher, and the occupied computer resources are lower; compared with a single machine learning classification model, the recruitment related problem classification accuracy rate and the recall rate are higher.

In some embodiments, before obtaining the predicted value of the type of the text based on the feature vector through the support vector machine model, the random forest model, the logistic regression model, and the text classification model of the first stage, respectively, the text processing method may further include: obtaining a training sample; carrying out augmentation processing of near-synonym replacement, dual translation and/or related search on the training sample to obtain a processed training sample; performing word segmentation processing on the processed training sample according to a preset strategy to generate a word set containing a plurality of words; acquiring a sample characteristic vector of the processed training sample according to the word set; and training a support vector machine model, a random forest model, a logistic regression model, a text classification model and an extreme gradient lifting model according to the sample feature vector.

In order to improve the accuracy of model detection, each model can be trained in advance. In order to enrich the training samples and improve the diversity of the training samples so as to improve the reliability of model training, the training samples can be subjected to preprocessing such as data augmentation. For example, as shown in fig. 2, one or more training samples (i.e., a small amount of original data) may be obtained, and each training sample may be subjected to augmentation processing of synonym replacement, dual translation and/or related search, respectively, to obtain processed training samples (i.e., a large amount of diversified data), so that the number of training samples and the diversity of the training samples may be increased before training the model, so as to improve the performance of each model in performing tasks such as classification or response.

In some embodiments, the augmented process of near-sense word replacement is performed on the training samples, and obtaining the processed training samples may include: performing word segmentation processing on the training sample to obtain a plurality of words; searching words with similarity greater than a preset threshold value with each word from a near-sense word library to obtain candidate words; and (5) carrying out permutation and combination on the candidate words to generate a processed training sample.

For example, the manner of synonym replacement may include: constructing a near-sense word bank, wherein the near-sense word bank can comprise a plurality of words, performing word segmentation on a training sample to obtain a plurality of words, calculating the similarity between each word obtained by word segmentation and each word in the near-sense word bank, screening out words with the similarity larger than a preset threshold value from the near-sense word bank to obtain candidate words, wherein the preset threshold value can be flexibly set according to actual needs, and the similarity larger than the preset threshold value indicates that the word between the two words is a near-sense word. Then, the candidate words are arranged and combined to generate a processed training sample, the rules of the arrangement and combination can be flexibly set according to actual needs, and the processed training sample can include a plurality of training samples, that is, the processed training sample is a sentence set.

In some embodiments, performing an augmentation process on the dual translation on the training samples, and obtaining the processed training samples may include: acquiring a current language type of a training sample; translating the training sample from the current language type to a target language type to obtain a translated training sample; and translating the translated training sample from the target language type to the current language type to obtain a processed training sample.

For example, the manner of dual translation may include: firstly, obtaining a current language type of a training sample, for example, the current language type is chinese, japanese, french, korean, english, or the like, then translating the training sample from the current language type to a target language type, obtaining a translated training sample, where the target language type may include chinese, japanese, french, korean, english, or the like, secondly, translating the translated training sample from the target language type to the current language type, and finally obtaining a processed training sample.

The manner of related search may include: the training samples are augmented by the related search function of the search engine such as google or Baidu, and for example, the search engine can search for the related training samples such as question answering, chatting, freight consultation or shopping.

The processed training samples are obtained by carrying out augmentation processing of near-synonym replacement, dual translation and/or related search on the training samples, so that data augmentation based on a small amount of training samples is realized, a large amount of processed training samples are obtained, and the training samples for model training are greatly enriched.

After the processed training sample is obtained, word segmentation processing may be performed on the processed training sample according to a preset strategy, so as to generate a word set including a plurality of words, where the preset strategy is similar to the aforementioned preset strategy. And then acquiring a sample feature vector of the processed training sample according to the word set.

In some embodiments, obtaining the sample feature vector of the processed training sample from the word set may include: acquiring the frequency of each word in the word set in each processed training sample; screening out training samples containing words in the word set from the processed training samples to obtain target training samples; acquiring the reverse text frequency of a target training sample in the processed training sample; generating a parameter corresponding to each word according to the frequency and the reverse text frequency; and generating a sample feature vector according to the parameter corresponding to each word.

Specifically, the sample feature vector of the processed training sample can be obtained through a weighting algorithm (tf-idf), which is a weighting technique for information retrieval and text mining and can be used to evaluate the importance degree of a word for a piece of text or for one of a plurality of training samples. The importance of a word increases in direct proportion to the number of times it appears in the text, and decreases in inverse proportion to the frequency with which it appears in multiple training samples.

Where tf in tf-idf represents a word frequency, and in a given document, a word frequency (tf) refers to a frequency of occurrence of a given word in the document, that is, a frequency of occurrence of a word in a training sample in this embodiment. Idf in tf-idf represents the reverse text frequency, and is the normalization of the number of words (i.e., the number of occurrences), since the same word may have a higher number of words in a longer document than in a shorter document regardless of the importance of the word, the reverse text frequency prevents the number of words from biasing toward a longer document.

First, the frequency of each word in the word set existing in each processed training sample may be obtained, and for a word ti in a certain training sample dj, the calculation formula of the frequency (i.e., the word frequency) existing in the training sample dj is as follows:

in the above formula, tf_i,jRepresenting the word frequency, n, of the word ti in the training samples dj_i,jRepresents the number of times, Σ, that the word ti occurs in the training sample dj_kn_k,jRepresenting the sum of the number of occurrences of all words in the training sample dj. For example, when the training sample dj is cut into 3 words, k is 3, Σ_kn_k,jRepresenting the sum of the number of occurrences of these 3 words in the training sample dj.

Then, a training sample containing the words in the word set is screened out from the processed training samples to obtain a target training sample, and the reverse text frequency of the target training sample in the processed training sample is obtained. Among them, the inverse document frequency (idf) is a measure of the general importance of a word. For term ti, the reverse text frequency of the target training samples containing term ti in the training samples can be obtained by dividing the total number of the training samples by the number of the target training samples containing term ti, and then taking the logarithm of the obtained quotient, wherein the calculation formula is as follows:

wherein idf_iRepresenting reverse text frequency, | D | representing the total number of multiple training samples, | { j: t_i∈d_jDenotes the number of target training samples containing the word ti (i.e., n)_i,j| A Number of training samples of 0).

Since the denominator is zero if the word ti is not in the training samples, the following calculation can be used:

obtaining the frequency tf at which the word ti exists in a certain training sample dj_i,jAnd inverse text frequency idf_iThen, according to the frequency tf_i,jAnd reverse text frequency idf_iCalculating a parameter a corresponding to the word, wherein the calculation formula is as follows: a-tf_i,j×idf_i。

Finally, after the frequency of each word in the word set in each training sample and the reverse text frequency of the target training sample containing the word in the plurality of training samples are calculated according to the method, the parameter corresponding to each word can be generated according to the frequency and the reverse text frequency, and then the sample feature vector is generated according to the parameter corresponding to each word.

It should be noted that, in addition to the sample feature vector of the training sample after tf-idf obtaining, the sample feature vector of the training sample after processing may be obtained through a word2vec model, or the sample feature vector of the training sample after processing may be obtained through other manners, and specific content is not limited herein.

After the sample feature vectors are obtained, a support vector machine model, a random forest model, a logistic regression model, a text classification model and an extreme gradient lifting model can be trained according to the sample feature vectors.

Firstly, inputting a sample feature vector into an SVM model for prediction to obtain a first training prediction value; inputting the sample feature vector into a random forest model for prediction to obtain a second training prediction value; inputting the sample feature vector into a logistic regression model for prediction to obtain a third training prediction value; and inputting the sample characteristic vector into a FastText model for prediction to obtain a fourth training prediction value. And splicing the four predicted values, such as the first training predicted value, the second training predicted value, the third training predicted value, the fourth training predicted value and the like, into a predicted feature vector, and inputting the predicted feature vector into a second-stage XGboost model for training to obtain a type predicted value corresponding to a training sample.

After the training is finished, predicted values of the types of the texts can be respectively obtained based on the feature vectors through the trained SVM model of the first level, the trained random forest model, the trained logistic regression model and the trained FastText model, and the types of the texts are determined through the trained FastText model of the second level according to the predicted values.

And S105, determining the reply information corresponding to the text according to the type of the text.

In some implementations, determining the response information corresponding to the text according to the type of the text may include: when the type of the text is a first type, obtaining sentence similarity and word similarity between the text and a standard question-answer pair in a preset database based on the characteristic vector; determining the score of each standard question-answer pair according to the sentence similarity and the word similarity; and taking the response of the standard question-answer pair with the highest score as the response information corresponding to the text.

The first type may be a question-answer type (QA question-answer type), and when the text type is the question-answer type, sentence similarity and word similarity between the text and a standard question-answer pair in a preset database may be obtained based on the feature vector, where the standard question-answer pair may be a correspondence between a question and an answer (i.e., answer information).

For example, an edit distance, an euclidean distance, a hamming distance, or the like between the text and a question sentence (i.e., text) of a standard question-and-answer pair in the preset database may be calculated, and sentence similarity may be determined according to the edit distance, the euclidean distance, the hamming distance, or the like between the texts; and calculating the edit distance, the Euclidean distance or the Hamming distance between the words in the text and the words contained in the questions of the standard question-answer pairs in the preset database, and determining the similarity of the words according to the edit distance, the Euclidean distance or the Hamming distance between the words.

Wherein the edit distance between the texts may refer to a minimum number of editing operations required for converting one of the texts into the other text for the two texts. The larger the edit distance is, the more different features between two texts are indicated, whereas the smaller the edit distance is, the less different features between two texts are indicated, and the edit operation may include replacing one character in a text with another character, inserting one character in a text, deleting one character in a text, and the like. The editing distance between the text and the question sentence of the standard question-answer pair is determined, namely the minimum number of editing operations required for changing the text into the question sentence is determined, and the overall similarity of the two texts such as the text and the question sentence can be measured by using the editing distance.

The euclidean distance between the texts may be a straight line distance between two points of the first text and the second text in the euclidean space, and the euclidean distance in the embodiment of the present application is used to measure a degree of a difference between the two texts, such as a question sentence of a text and a standard question-answer pair.

The hamming distance may refer to the number of characters at different corresponding positions of the first text and the second text, and in the embodiment of the present application, the hamming distance may be the number of times of replacement required to convert the text into a question sentence of a standard question-and-answer pair, and the hamming distance may be used to measure the absolute consistency between the corresponding positions of the two texts, such as the question sentence of the text and the standard question-and-answer pair.

Note that the edit distance, euclidean distance, hamming distance, and the like between words can be understood in terms of the edit distance, euclidean distance, hamming distance, and the like between texts.

At this time, the score of each standard question-answer pair may be determined according to the sentence similarity and the word similarity, where the higher the similarity is, the higher the score is, and conversely, the lower the similarity is, the lower the score is. The response of the standard question-answer pair with the highest score can then be used as the response information corresponding to the text. At the moment, the reply information can be fed back to the client according to actual requirements, so that correct answers can be selected from the standard question-answer pairs and the user corresponding to the client can be replied.

For example, the sentence similarity and the word similarity may be spliced to obtain the spliced similarity, and the score of each standard question-answer pair is calculated by an SVM model for determining the score, which is different from the above model for determining the type. And then sorting the standard question-answer pairs according to the sequence of scores from high to low, screening out the answer of the standard question-answer pair with the highest score from the sorted standard question-answer pairs as answer information corresponding to the text, and feeding back the answer information to the client according to actual requirements. The score of each standard question-answer pair is determined through the characteristics of sentence similarity, word similarity and the like, and the reply information is screened based on the scores, so that the reply accuracy can be greatly improved.

In some implementations, determining the response information corresponding to the text according to the type of the text may include: when the type of the text is a second type, acquiring the language type of the text; when the translation is determined to be needed according to the language type of the text, translating the text to obtain a translated text; and acquiring response information matched with the translated text from the preset corpus.

The second type can be a chatting type, when the type of the text is the chatting type, the language type of the text can be obtained, whether the text needs to be translated or not is judged based on the language type, and when the translation is determined to be needed according to the language type of the text, the text can be translated to obtain the translated text; and then, calling a DeepQA model, acquiring reply information matched with the translated text from a preset corpus, and feeding the reply information back to the client. When the fact that translation is not needed is determined according to the language type of the text, the deep QA model can be called, the reply information matched with the translated text is obtained from the preset corpus, and the reply information can be fed back to the client according to actual requirements.

The method and the device for processing the text can obtain the text to be processed, perform word segmentation processing on the text according to a preset strategy to obtain words forming the text, perform feature extraction on the text according to the words to obtain feature vectors, determine the type of the text through a multi-stage cascade classification model based on the feature vectors, and determine the reply information corresponding to the text according to the type of the text. According to the scheme, the type of the text can be accurately determined through fusion of the multi-stage cascade classification models, the corresponding reply information can be quickly positioned according to the type of the text, and the accuracy and efficiency of the reply are improved.

The text processing method described in the above embodiments will be described in further detail below.

Referring to fig. 2, fig. 2 is another schematic flow chart of a text processing method according to an embodiment of the present disclosure. The text processing method can be applied to a server, and will be described in detail below by taking a recruitment question and answer in the field of logistics as an example, as shown in fig. 2, the flow of the text processing method can be as follows:

s201, the server receives the problem sent by the client.

The client can be a client on the intelligent customer service robot, and the server can receive the problem about recruitment question consultation sent by the user through the client. The question may be one or more sentences or the like, which may include one or more words, which may include one or more characters, which may include words, letters, numbers, punctuation marks or the like.

S202, the server performs word segmentation processing on the problem according to a preset strategy to obtain words forming the problem.

The server can perform word segmentation on the problem according to the preset strategy to obtain words forming the problem. For example, the server may: the word segmentation is performed according to the 'what is paid for the work', and the obtained words can include: "this", "work", "pay", "what", etc.

And S203, the server extracts the features of the problems according to the words to obtain feature vectors.

After performing word segmentation processing on a problem and generating one or more words, a server may obtain a parameter corresponding to each word, where one parameter is used to identify one word, and the parameter may be a number or a character uniquely identifying a word. At this time, the server may extract features of the question according to words constituting the question, where the features may be keywords or words with specific meanings, and generate feature vectors based on the extracted keywords or words with specific meanings, and corresponding parameters.

S204, the server obtains the predicted values of the types of the problems through the SVM model, the random forest model, the logistic regression model and the FastText model of the first level respectively based on the feature vectors.

S205, the server determines the type of the problem through a second-level XGboost model according to the predicted value.

For example, as shown in fig. 4, the server may perform intent classification processing according to a question input by a user, and determine whether the question is of a QA question-answer type (which may be referred to as a question-answer type for short) or a chat type, so as to perform a corresponding reply according to the type of the question. To improve the accuracy and efficiency of the problem type determination, the server may predict the user's current intent based on the problem through a stacked combination of multiple models, respectively.

For example, as shown in fig. 5, the server may input the feature vector into a first-level model, obtain a first predicted value of the type of the question as a question-answer type based on the feature vector through the SVM model of the first level, obtain a second predicted value of the type of the question as a chatting type based on the feature vector through the random forest model of the first level, obtain a third predicted value of the type of the question as a question-answer type based on the feature vector through the logistic regression model of the first level, and obtain a fourth predicted value of the type of the question as a question-answer type based on the feature vector through the FastText model of the first level. After the four predicted values are obtained, the problem type can be determined to be a question and answer type through the second-level XGboost model based on the first predicted value, the second predicted value, the third predicted value and the fourth predicted value, so that the current intention of a user can be accurately predicted, and the classification accuracy and the recall rate of the recruitment-related problems are improved.

The server determines the type of the problem through the SVM model, the random forest model, the logistic regression model and the FastText model of the first level and the XGboost model of the second level, compared with rule matching, a large amount of manual production and maintenance rules are not needed, and the generalization capability is stronger; compared with a deep learning classification algorithm, the method has the advantages that the required data volume is smaller, the training speed is higher, and the occupied computer resources are lower; compared with a single machine learning classification model, the recruitment related problem classification accuracy rate and the recall rate are higher.

S206, the server judges whether the type of the question is a question-answer type; if yes, go to step S207; if not, go to step S210.

S207, the server obtains sentence similarity and word similarity between the question and the standard question-answer pair in the preset database based on the feature vector.

When the type of the question is a question-answer type, the server can calculate the edit distance, the Euclidean distance or the Hamming distance between the question and the question of the standard question-answer pair in the preset database, and determine the sentence similarity according to the edit distance, the Euclidean distance or the Hamming distance between the questions and the like; and calculating the edit distance, the Euclidean distance or the Hamming distance between the words in the question and the words contained in the question of the standard question-answer pair in the preset database, and determining the similarity of the words according to the edit distance, the Euclidean distance or the Hamming distance between the words.

S208, the server determines the score of each standard question-answer pair according to the sentence similarity and the word similarity.

The server can determine the score of each standard question-answer pair according to the sentence similarity and the word similarity, wherein the higher the similarity is, the higher the score is, and conversely, the lower the similarity is, the lower the score is. For example, a mapping relationship among sentence similarity, word similarity and score may be established, where different similarities correspond to different scores, or different similarity intervals correspond to different scores, and the like. After the sentence similarity and the word similarity are obtained, the score of each standard question-answer pair can be determined according to the sentence similarity, the mapping relation between the word similarity and the score.

For another example, as shown in fig. 6, the server may splice the sentence similarity and the word similarity to obtain a spliced similarity, and calculate a score of each standard question-answer pair through an SVM model for determining the score, so as to perform a corresponding reply based on the scores.

S209, the server takes the response of the standard question-answer pair with the highest score as response information corresponding to the question, and feeds the response information back to the client.

The server can sort the standard question-answer pairs according to the order of scores from high to low, screen out the answer of the standard question-answer pair with the highest score in the front from the sorted standard question-answer pairs as the answer information corresponding to the question, and feed back the answer information to the client. The correct answers are selected from the standard question-answering pairs and the user corresponding to the client is replied, and the accuracy rate of replying is improved.

S210, the server acquires the language type of the question.

S211, the server judges whether translation is needed according to the language type; if yes, go to step S212; if not, go to step S214.

When the type of the question is a chatting type, the server can acquire the language type of the question and judge whether the question needs to be translated or not based on the language type.

S212, the server translates the question to obtain the translated question.

S213, the server acquires the reply information matched with the translated question from the preset corpus and feeds the reply information back to the client.

For example, as shown in fig. 7, when it is determined that translation is required according to the language type of the question, the question may be translated to obtain the translated question, for example, when the language type of the question is french, if the predetermined corpus only contains information such as chinese and english, the question may be translated from french to chinese. And then, calling a DeepQA model, acquiring reply information matched with the translated question from a preset corpus, and feeding the reply information back to the client.

S214, when the fact that translation is not needed is determined according to the language type of the question, response information matched with the question is obtained from the preset corpus, and the response information is fed back to the client.

It should be noted that, in order to improve the accuracy of the response corresponding to the chat-type question, the response information matched with the question according to the preset correction policy may be corrected, and the corrected response information is fed back to the client. The preset correction strategy can be flexibly set according to actual needs, and specific contents are not limited herein. For example, when a question sent by the client is "what name you call", if the answer information matched with the question is obtained from the preset corpus as "small a", and the real name is called as small B, at this time, the "small a" may be corrected to be "small B" according to the preset correction strategy, and then the answer information of the "small B" is fed back to the client.

According to the method and the device, the server can receive the questions sent by the client, perform word segmentation processing on the questions according to the preset strategy, perform feature extraction on the questions according to words obtained through word segmentation to obtain feature vectors, then obtain predicted values of the types of the questions through a first-level SVM model, a random forest model, a logistic regression model and a FastText model respectively based on the feature vectors, determine the types of the questions through a second-level BooXGate model according to the predicted values, at the moment, can determine reply information corresponding to the questions according to the types of the questions, and feed the reply information back to the client. According to the scheme, the problem type can be accurately determined through the fusion of the multi-stage cascade classification models, the corresponding reply information can be quickly positioned according to the problem type, the problems that a large number of repeated applicants need to be manually processed, the recruitment work needs to be performed and the like are solved, the great waste of labor cost is avoided, and the accuracy and the efficiency of reply are improved.

In order to better implement the text processing method provided by the embodiment of the present application, an embodiment of the present application further provides a device based on the text processing method. The terms are the same as those in the text processing method, and specific implementation details can be referred to the description in the method embodiment.

Referring to fig. 8, fig. 8 is a schematic structural diagram of a text processing apparatus according to an embodiment of the present disclosure, wherein the text processing apparatus 300 may include a receiving module 301, a word segmentation module 302, an extraction module 303, a determination module 304, a feedback module 305, and the like.

The receiving module 301 is configured to obtain a text to be processed.

And the word segmentation module 302 is configured to perform word segmentation processing on the text according to a preset policy to obtain words forming the text.

And the extraction module 303 is configured to perform feature extraction on the text according to the words to obtain feature vectors.

A determining module 304, configured to determine the type of the text through a multi-level cascaded classification model based on the feature vectors.

And a feedback module 305, configured to determine reply information corresponding to the text according to the type of the text.

In some embodiments, the classification models of the multi-level cascade include a support vector machine model, a random forest model, a logistic regression model, a text classification model, and an extreme gradient boost model, and the determination module 304 is specifically configured to: respectively obtaining predicted values of the types of the texts through a first-stage support vector machine model, a random forest model, a logistic regression model and a text classification model based on the feature vectors; and determining the type of the text through an extreme gradient lifting model of the second stage according to the predicted value.

In some embodiments, the text processing apparatus 300 further includes a first obtaining module, a processing module, a generating module, a second obtaining module, a training module, and the like, which may specifically be as follows:

the first acquisition module is used for acquiring a training sample;

the processing module is used for carrying out augmentation processing of synonym replacement, dual translation and/or related search on the training samples to obtain processed training samples;

the generating module is used for carrying out word segmentation processing on the processed training sample according to a preset strategy to generate a word set containing a plurality of words;

In some embodiments, the second obtaining module is specifically configured to: acquiring the frequency of each word in the word set in each processed training sample; screening out training samples containing words in the word set from the processed training samples to obtain target training samples; acquiring the reverse text frequency of a target training sample in the processed training sample; generating a parameter corresponding to each word according to the frequency and the reverse text frequency; and generating a sample feature vector according to the parameter corresponding to each word.

In some embodiments, the processing module is specifically configured to: performing word segmentation processing on the training sample to obtain a plurality of words; searching words with similarity greater than a preset threshold value with each word from a near-sense word library to obtain candidate words; and (5) carrying out permutation and combination on the candidate words to generate a processed training sample.

In some embodiments, the processing module is specifically configured to: acquiring a current language type of a training sample;

translating the training sample from the current language type to a target language type to obtain a translated training sample; and translating the translated training sample from the target language type to the current language type to obtain a processed training sample.

In some embodiments, the feedback module 305 is specifically configured to: when the type of the text is a first type, obtaining sentence similarity and word similarity between the text and a standard question-answer pair in a preset database based on the characteristic vector; determining the score of each standard question-answer pair according to the sentence similarity and the word similarity; and taking the response of the standard question-answer pair with the highest score as the response information corresponding to the text.

In some embodiments, the feedback module 305 is specifically configured to: when the type of the text is a second type, acquiring the language type of the text; when the translation is determined to be needed according to the language type of the text, translating the text to obtain a translated text; and acquiring response information matched with the translated text from the preset corpus.

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

In the embodiment of the application, the receiving module 301 may obtain a text to be processed, the word segmentation module 302 performs word segmentation on the text according to a preset strategy to obtain words forming the text, the extracting module 303 performs feature extraction on the text according to the words to obtain feature vectors, the determining module 304 determines the type of the text through a multi-stage cascade classification model based on the feature vectors, and the feedback module 305 may determine reply information corresponding to the text according to the type of the text. According to the scheme, the type of the text can be accurately determined through fusion of the multi-stage cascade classification models, the corresponding reply information can be quickly positioned according to the type of the text, and the accuracy and efficiency of information reply of text processing are improved.

The embodiment of the present application further provides a server, as shown in fig. 9, which shows a schematic structural diagram of the server according to the embodiment of the present application, specifically: the server may include components such as a processor 401 of one or more processing cores, memory 402 of one or more computer-readable storage media, a power supply 403, and an input unit 404. Those skilled in the art will appreciate that the server architecture shown in FIG. 9 does not constitute a limitation on the servers, and may include more or fewer components than shown, or some components in combination, or a different arrangement of components. Wherein:

the processor 401 is a control center of the server, connects various parts of the entire server using various interfaces and lines, and performs various functions of the server and processes data by running or executing software programs and/or modules stored in the memory 402 and calling data stored in the memory 402, thereby performing overall monitoring of the server. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by operating the software programs and modules stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to the use of the server, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 access to the memory 402.

The server further includes a power supply 403 for supplying power to each component, and preferably, the power supply 403 may be logically connected to the processor 401 through a power management system, so as to implement functions of managing charging, discharging, and power consumption through the power management system. The power supply 403 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

The server may also include an input unit 404, the input unit 404 being operable to receive input numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the server may further include a display unit and the like, which will not be described in detail herein. Specifically, in this embodiment, the processor 401 in the server loads the executable file corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 401 runs the application program stored in the memory 402, so as to execute the following functions, as follows: acquiring a text to be processed; performing word segmentation processing on the text according to a preset strategy to obtain words forming the text; extracting the features of the text according to the words to obtain feature vectors; determining the type of the text through a multi-stage cascading classification model based on the feature vector; and determining the reply information corresponding to the text according to the type of the text.

In some embodiments, the multi-level cascaded classification model includes a support vector machine model, a random forest model, a logistic regression model, a text classification model, and an extreme gradient boost model, and when determining the type of the text through the multi-level cascaded classification model based on the feature vector, processor 401 may further perform:

respectively obtaining predicted values of the types of the texts through a first-stage support vector machine model, a random forest model, a logistic regression model and a text classification model based on the feature vectors; and determining the type of the text through an extreme gradient lifting model of the second stage according to the predicted value.

In some embodiments, before obtaining the predicted value of the type of the text through the support vector machine model, the random forest model, the logistic regression model, and the text classification model of the first stage based on the feature vector, respectively, the processor 401 may further perform:

obtaining a training sample; carrying out augmentation processing of near-synonym replacement, dual translation and/or related search on the training sample to obtain a processed training sample; performing word segmentation processing on the processed training sample according to a preset strategy to generate a word set containing a plurality of words; acquiring a sample characteristic vector of the processed training sample according to the word set; and training a support vector machine model, a random forest model, a logistic regression model, a text classification model and an extreme gradient lifting model according to the sample feature vector.

In some embodiments, when obtaining the sample feature vectors of the processed training samples according to the word set, the processor 401 may further perform:

acquiring the frequency of each word in the word set in each processed training sample; screening out training samples containing words in the word set from the processed training samples to obtain target training samples; acquiring the reverse text frequency of a target training sample in the processed training sample; generating a parameter corresponding to each word according to the frequency and the reverse text frequency; and generating a sample feature vector according to the parameter corresponding to each word.

In some embodiments, when performing augmentation process of near-sense word replacement on the training samples to obtain processed training samples, the processor 401 may further perform:

performing word segmentation processing on the training sample to obtain a plurality of words; searching words with similarity greater than a preset threshold value with each word from a near-sense word library to obtain candidate words; and (5) carrying out permutation and combination on the candidate words to generate a processed training sample.

In some embodiments, when performing the augmentation process of the dual translation on the training samples to obtain the processed training samples, the processor 401 may further perform:

acquiring a current language type of a training sample; translating the training sample from the current language type to a target language type to obtain a translated training sample; and translating the translated training sample from the target language type to the current language type to obtain a processed training sample.

In some embodiments, when determining the reply information corresponding to the text according to the type of the text, processor 401 may further perform:

when the type of the text is a first type, obtaining sentence similarity and word similarity between the text and a standard question-answer pair in a preset database based on the characteristic vector; determining the score of each standard question-answer pair according to the sentence similarity and the word similarity; and taking the response of the standard question-answer pair with the highest score as the response information corresponding to the text.

when the type of the text is a second type, acquiring the language type of the text; when the translation is determined to be needed according to the language type of the text, translating the text to obtain a translated text; and acquiring response information matched with the translated text from the preset corpus.

In the above embodiments, the descriptions of the embodiments have respective emphasis, and parts that are not described in detail in a certain embodiment may refer to the above detailed description of the text processing method, and are not described herein again. It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, embodiments of the present application provide a storage medium, in which a computer program is stored, where the computer program can be loaded by a processor to execute any one of the text processing methods provided by the embodiments of the present application. For example, the computer program is loaded by a processor and may perform the following steps: acquiring a text to be processed; performing word segmentation processing on the text according to a preset strategy to obtain words forming the text; extracting the features of the text according to the words to obtain feature vectors; determining the type of the text through a multi-stage cascading classification model based on the feature vector; and determining the reply information corresponding to the text according to the type of the text.

Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the storage medium can execute any text processing method provided in the embodiments of the present application, beneficial effects that can be achieved by any text processing method provided in the embodiments of the present application can be achieved, for details, see the foregoing embodiments, and are not described herein again. The text processing method, the text processing apparatus, the text processing server and the storage medium provided by the embodiments of the present application are described in detail above, and specific examples are applied in the text to explain the principles and implementations of the present application, and the descriptions of the above embodiments are only used to help understand the method and the core ideas of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method of text processing, comprising:

acquiring a text to be processed;

2. The method of claim 1, wherein the multi-level cascaded classification models comprise a support vector machine model, a random forest model, a logistic regression model, a text classification model, and an extreme gradient boosting model, and wherein the determining the type of the text by the multi-level cascaded classification models based on the feature vectors comprises:

3. The text processing method according to claim 2, wherein before obtaining the predicted value of the type of the text based on the feature vector through a support vector machine model, a random forest model, a logistic regression model, and a text classification model of a first stage, respectively, the method further comprises:

obtaining a training sample;

4. The method according to claim 3, wherein the performing augmentation process of near-sense word replacement on the training samples to obtain the processed training samples comprises:

5. The method of claim 3, wherein performing augmentation processing on the training samples for dual translation to obtain processed training samples comprises:

acquiring the current language type of the training sample;

6. The text processing method according to any one of claims 1 to 5, wherein the determining reply information corresponding to the text according to the type of the text includes:

7. The text processing method according to any one of claims 1 to 5, wherein the determining reply information corresponding to the text according to the type of the text includes:

8. A text processing apparatus, comprising:

the receiving module is used for acquiring a text to be processed;

9. A server, characterized by comprising a processor and a memory, in which a computer program is stored, the processor executing the text processing method according to any one of claims 1 to 7 when calling the computer program in the memory.

10. A storage medium for storing a computer program which is loaded by a processor to perform the text processing method of any one of claims 1 to 7.