CN113254587B

CN113254587B - Search text recognition method and device, computer equipment and storage medium

Info

Publication number: CN113254587B
Application number: CN202110605909.XA
Authority: CN
Inventors: 赵海林; 魏强
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2023-10-13
Anticipated expiration: 2041-05-31
Also published as: CN113254587A

Abstract

The embodiment of the invention relates to a method, a device, computer equipment and a storage medium for identifying search text, which comprise the following steps: determining a word segmentation set and a near meaning word set corresponding to a search text to be identified; extracting multidimensional features of the word segmentation set to obtain a first input set and a third input set, wherein the first input set comprises two preset feature vector sets, and the third input set at least comprises the first input set; taking the word segmentation set and the paraphrasing set as a second input set; inputting the third input set into the second sub-model to obtain a third output result; the first input set, the second input set and the third output result are input into the first sub-model, so that the first sub-model outputs the recognition result of the search text, the semantics of the search text are not deficient any more through word segmentation and word expansion processing of the search text, the semantic range of the search text is expanded, and the short and semantically deficient search text is recognized accurately as a whole.

Description

Search text recognition method and device, computer equipment and storage medium

Technical Field

The embodiment of the invention relates to the field of text classification, in particular to a method and a device for identifying search text, computer equipment and a storage medium.

Background

With the development of the internet, people's life, work, etc. are closely related to the internet, so that people can generate a large amount of data such as text, voice, image, video, etc. on the internet, and information search is one of the most important demands of internet users, and a corresponding search operation is performed by inputting search text in a search engine.

The recognition and supervision of the search words are beneficial to purifying the network environment, improving the user experience, avoiding related policies and legal risks, and the recognition of the search text is equivalent to the task of text classification, and the common long text classification means cannot effectively solve the recognition task of the search words due to the characteristics of relatively short search words, lack of semantic information and the like, so that the accurate judgment cannot be realized on judging whether the search text carries the popular attributes or not.

Disclosure of Invention

In view of this, in order to solve the above technical problems or some of the technical problems, embodiments of the present invention provide a method, an apparatus, a computer device, and a storage medium for identifying a search text.

In a first aspect, an embodiment of the present invention provides a method for identifying a search text, where the search text identification model includes a first sub-model and a second sub-model, and the method includes:

determining a word segmentation set and a near meaning word set corresponding to a search text to be identified;

extracting multidimensional features of the word segmentation set to obtain a first input set and a third input set, wherein the first input set comprises two preset feature vector sets, and the third input set at least comprises the first input set;

taking the word segmentation set and the paraphrasing set as a second input set;

inputting the third input set into a second sub-model to obtain a third output result;

and inputting the first input set, the second input set and the third output result into the first sub-model so that the first sub-model outputs the recognition result of the search text.

In one possible implementation manner, the determining the word segmentation set and the word hyponym set corresponding to the search text to be identified includes:

word segmentation processing is carried out on the search text to obtain a word segmentation set containing a plurality of words and/or phrases; and carrying out semantic matching on each word and/or phrase in the word segmentation set, and taking one or more hyponyms with the semantic similarity of each word being larger than a set threshold value and/or one or more hyponym groups with the semantic similarity of each phrase being larger than the set threshold value as the hyponym set corresponding to the word segmentation set.

In one possible implementation, the first set of inputs includes: a first set of vectors and a second set of vectors;

the third set of inputs includes at least two of: a first set of vectors, a second set of vectors, a third set of vectors, a fourth set of vectors, or a fifth set of vectors;

the first set of vectors includes: vector representation of word dimensions corresponding to each word in the word segmentation set;

the second set of vectors includes: vector representation of character dimensions corresponding to each word in the word segmentation set;

the third vector set includes: a vector representation of a text dimension corresponding to the search text;

the fourth set of vectors includes: vector representation of text distribution dimensions corresponding to the search text;

the fifth set of vectors includes: and each word in the word segmentation set corresponds to a vector representation of word segmentation semantics.

In one possible embodiment, the first sub-model is: a Wide & Deep model including a neural network part and a linear part;

the inputting the first input set, the second input set and the third output result into the first sub-model to enable the first sub-model to output the recognition result of the search text includes:

Inputting the first input set and the third output result to the neural network part so that the neural network part outputs the first output result corresponding to the search text;

inputting the second input set and the third output result to the linear part so that the linear part outputs the second output result corresponding to the search text;

and taking the first output result and the second output result as the recognition result of the search text.

In one possible embodiment, the second sub-model is: the third output result output by the Xgboost model comprises a probability value corresponding to the search text and leaf node numbering characteristics corresponding to the word segmentation set and the near-meaning word set;

the inputting the first input set and the third output result to the neural network part, so that the neural network part outputs the first output result corresponding to the search text, includes:

inputting the first input set into the neural network part, so that the neural network part extracts local features of the word segmentation set from the first input set, fuses the local features to obtain global features of the word segmentation set, and obtains the first output result of the search text by combining the global features with the probability value in the third output result.

In one possible implementation manner, the inputting the second input set and the third output result into the linear portion, so that the linear portion outputs the second output result corresponding to the search text, includes:

and inputting the second input set into the linear part, so that the linear part obtains the second output result of the search text according to the word segmentation set and the paraphrasing set and by combining the leaf node numbering characteristic in the third output result.

In one possible implementation, the search text recognition model is constructed by:

aiming at any search text in a training sample, word segmentation is carried out on the search text to obtain a word segmentation set corresponding to the search text, and word expansion processing is carried out on each word segmentation in the word segmentation set to obtain a near meaning word set corresponding to the word segmentation set; determining a first input set of the first sub-model according to the word segmentation set, determining a second input set of the first sub-model according to the word segmentation set and the paraphrasing set, and determining a third input set of the second sub-model according to the word segmentation set; inputting the third input set into the second sub-model so that the second sub-model outputs a corresponding third output result; and training the first sub-model by using the first input set, the second input set and the third output result, and determining that the training of the search text recognition model is completed when the training of the first sub-model is completed.

the training the first sub-model using the first set of inputs, the second set of inputs, and the third output result includes:

inputting the first input set and the third output result to the neural network part, and training the neural network part; and inputting the second input set and the third output result into the linear part, and training the linear part.

In one possible implementation, a loss function is constructed from the first output result of the neural network section and the second output result of the linear section; assisting the first sub-model to train by using the loss function;

wherein the conditions for the completion of the first sub-model training include: the loss function satisfies a preset convergence condition.

In a second aspect, an embodiment of the present invention provides a search text recognition apparatus, where the search text recognition model includes a first sub-model and a second sub-model, and the apparatus includes:

The set determining module is used for determining a word segmentation set and a near meaning word set corresponding to the search text to be identified;

the extraction module is used for extracting multidimensional features of the word segmentation set to obtain a first input set and a third input set, wherein the first input set comprises two preset feature vector sets, and the third input set at least comprises the first input set;

the set determining module is further configured to use the word segmentation set and the paraphrasing set as a second input set;

the output determining module is used for inputting the third input set into the second sub-model to obtain a third output result;

and the result determining module is used for inputting the first input set, the second input set and the third output result into the first sub-model so that the first sub-model outputs the recognition result of the search text.

In a third aspect, an embodiment of the present invention provides a computer apparatus, including: a processor and a memory, wherein the processor is configured to execute a program for constructing a search text recognition model stored in the memory, so as to implement the method for recognizing a search text according to any one of the first aspect.

In a fourth aspect, an embodiment of the present invention provides a storage medium, where one or more programs are stored, where the one or more programs are executable by one or more processors to implement the method for identifying a search text according to any one of the first aspects.

According to the recognition scheme of the search text, provided by the embodiment of the invention, through word segmentation and word expansion processing of the search text, the semantics of the search text are not deficient, the semantic range of the search text is expanded, multidimensional feature extraction is performed from a word segmentation set and a near-meaning word set, the extracted feature vectors are classified and recognized by a model, so that the recognition result of the search text is obtained, and the short and short search text with the deficient semantics is recognized accurately on the whole.

Drawings

FIG. 1 is a schematic flow chart of a method for constructing a search text recognition model according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of word segmentation and expansion processing on the search text to obtain a word segmentation set and a near meaning word set corresponding to the search text according to the embodiment of the invention;

FIG. 3 is a schematic flow chart of determining an input set according to an embodiment of the present invention;

FIG. 4 is a schematic flow chart of obtaining a third output result according to an embodiment of the present invention;

FIG. 5 is a schematic flow chart of training a first sub-model according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a structure of a search text recognition model according to an embodiment of the present invention;

FIG. 7 is a flowchart of a method for identifying a search text according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an apparatus for identifying a search text according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

For the purpose of facilitating an understanding of the embodiments of the present invention, reference will now be made to the following description of specific embodiments, taken in conjunction with the accompanying drawings, which are not intended to limit the embodiments of the invention.

Fig. 1 is a flow chart of a method for constructing a search text recognition model according to an embodiment of the present invention, as shown in fig. 1, where the method specifically includes:

s11, aiming at any search text in a training sample, word segmentation and expansion processing are carried out on the search text, and a word segmentation set and a near meaning word set corresponding to the search text are obtained.

The method for constructing the search text recognition model is applied to training of the search text recognition model, the search text recognition model comprises a first sub-model and a second sub-model, the first sub-model comprises a neural network part and a linear part, and for example, the first sub-model can be: the Wide & Deep model, the second sub-model may be: the Xgboost model, the neural network part may be: the Deep side part in the Wide & Deep model, the linear part may be: the Wide side part in the Wide & Deep model.

Further, the second sub-model can be a pre-trained model, the first sub-model is a model to be trained, and the construction of the search text recognition model is completed through training of the first sub-model.

The search text related to this embodiment may be a search word input to a search engine, which has features of short length, poor semantics, and the like, and the search text recognition model mainly recognizes some search words carrying special meanings, for example, the search text recognition model may recognize search words carrying "soft pornography" information, and correspondingly, the training sample used for training the search text recognition model may be a search text labeled "soft pornography" in advance.

Further, for any search text in the training sample, word segmentation processing is performed on the search text, and particularly word segmentation operation of the search text can be performed by using a word segmentation tool (such as jieba, snowNLP, THULAC, NLPIR, etc.), and after the search text is subjected to word segmentation processing, a word segmentation set is obtained, wherein the word segmentation set comprises a plurality of words and/or phrases.

And carrying out expansion processing on each word and phrase after word segmentation processing of the search text, wherein the expansion processing aims at expanding the semantics of the search text, and obtaining a plurality of hyponyms corresponding to each word and a plurality of hyponym groups corresponding to each phrase by carrying out hyponym matching on each word and phrase so as to obtain a hyponymy set.

S12, determining a first input set of the first sub-model according to the word segmentation set, determining a second input set of the first sub-model according to the word segmentation set and the near meaning word set, and determining a third input set of the second sub-model according to the word segmentation set.

In this embodiment, the inputs of the first sub-model and the second sub-model are sequentially determined according to the word segmentation set and the near-meaning word set determined by searching the text in the above steps, and the first sub-model is divided into a neural network part and a linear part, and thus, the first sub-model includes two inputs.

Further, a first set of inputs (inputs to the neural network portion) of the first sub-model is determined with the set of segmentation words, a second set of inputs (inputs to the linear portion) of the first sub-model is determined with the set of segmentation words and the set of paraphrasing words, and a third set of inputs to the second sub-model is determined with the set of segmentation words.

S13, inputting the third input set into the second sub-model so that the second sub-model outputs a corresponding third output result.

S14, training the first sub-model by using the first input set, the second input set and the third output result, and determining that the training of the search text recognition model is completed when the training of the first sub-model is completed.

Specifically, the training process for the first sub-model may include: inputting the first input set and the third output result to the neural network part, and training the neural network part; and inputting the second input set and the third output result into the linear part, and training the linear part.

In an alternative of the embodiment of the present invention, a loss function is constructed according to the first output result of the neural network part and the second output result of the linear part; and training the first sub-model by using the loss function.

Accordingly, S14 may specifically include the following substeps:

s141 (not shown in the figure), training the first sub-model with the first set of inputs and the third output result as inputs to the neural network section and the second set of inputs and the third output result as inputs to the linear section.

And inputting the third input set into the second sub-model so that the second sub-model outputs a corresponding third output result, namely outputting a predicted probability value corresponding to the search text and leaf node numbering characteristics corresponding to the word segmentation set and the near-meaning word set by using the third input set through the Xgboost model.

And training the first sub-model by taking the first input set and the third output result as the input of the neural network part in the first sub-model and taking the second input set and the third output result as the input of the linear part in the first sub-model.

S142 (not shown in the figure), during the training of the first sub-model, constructing a loss function according to a first output result of the neural network part and a second output result of the linear part in the first sub-model, wherein the loss function is used for representing the training result of the search text recognition model.

S143 (not shown in the figure), training the search text recognition model based on the loss function in an auxiliary mode until the loss function meets a preset convergence condition, and determining that training of the search text recognition model is completed.

In this embodiment, training the search text recognition model is completed by training the first sub-model, so that the training result of the first sub-model is used as the training result of the search text recognition model, and in the training process of the first sub-model, a loss function for representing the training result of the first sub-model is constructed according to the first output result of the neural network part and the second output result of the linear part, and the loss function can also represent the training result of the search text recognition model.

Training is carried out by the search text recognition model assisted by the loss function, when the loss function meets the preset convergence condition (for example, the value of the loss function is not reduced any more), the completion of training the search text recognition model is determined, and when the loss function does not meet the preset convergence condition, training of the search text recognition model is continuously carried out by adjusting the operation parameters of the first sub-model until the loss function meets the preset convergence condition.

According to the method for constructing the search text recognition model, provided by the embodiment of the invention, through word segmentation and word expansion processing of the search text, the semantics of the search text are not deficient, the semantic range of the search text is expanded, multidimensional feature extraction is performed from a word segmentation set and a near-meaning word set, the model carries out classification recognition on the extracted feature vectors, the recognition result of the search text is obtained, and the short and short search text with the deficient semantics is accurately recognized on the whole.

Fig. 2 is a schematic flow chart of word segmentation and expansion processing on the search text to obtain a word segmentation set and a word hyponymy set corresponding to the search text according to the embodiment of the present invention, where, as shown in fig. 2, the method specifically includes:

s21, word segmentation processing is carried out on the search text, and a word segmentation set containing a plurality of words and/or phrases is obtained.

S22, carrying out semantic matching on each word and/or phrase in the word segmentation set, and taking one or more hyponyms with the semantic similarity of each word being larger than a set threshold value and/or one or more hyponym groups with the semantic similarity of each phrase being larger than the set threshold value as the hyponym set corresponding to the word segmentation set.

In this embodiment, a word segmentation tool is used to perform word segmentation processing on a search text to obtain a word segmentation set of a plurality of words and/or phrases (hereinafter, the words and phrases are collectively referred to as token sequences), and the rule of word segmentation may be that the obtained words or phrases are minimum and are not detachable, for example, the search text is: the happiness is prolonged, the attack is made to be a big result, and the corresponding word segmentation set is as follows: "delay happiness", "attack" and "big ending".

Further, for each word and/or phrase in the word segmentation set, performing the paraphrasing matching from the word stock, and taking one or more paraphrasing groups with each word similarity larger than a set threshold value as the paraphrasing set corresponding to the word segmentation set.

In the process of matching the paraphrasing, the maximum number of the paraphrasing words of each word or phrase can be set, and the size of the set threshold can be adjusted according to the matching result, for example, the maximum number of the paraphrasing words of each word or phrase is set to be 100, the set threshold is set to be 80%, and the matching result can be: "big end" - "last set" and so on.

Fig. 3 is a schematic flow chart of determining an input set according to an embodiment of the present invention, as shown in fig. 3, specifically including:

In this embodiment, the first input set includes two preset feature vector sets, where the two preset feature vectors may be: a first set of vectors and a second set of vectors, the third set of inputs comprising at least the first set of inputs.

The vector sets in the input set may be extracted using a model, in one example, the first vector set and the second vector set may be extracted using a first model, the third vector set and the fourth vector set may be extracted using a second model, and the fifth vector set may be extracted using a third model, wherein the first vector set comprises: vector representation of word dimensions corresponding to each word in the word segmentation set; the second set of vectors includes: vector representation of character dimensions corresponding to each word in the word segmentation set; the third vector set includes: a vector representation of a text dimension corresponding to the search text; the fourth set of vectors includes: vector representation of text distribution dimensions corresponding to the search text; the fifth set of vectors includes: and each word in the word segmentation set corresponds to a vector representation of word segmentation semantics.

In an example, the first set of vectors may be: a set of token level embeddin vectors, the second set of vectors may be: char level embedding vector set, the third vector set may be: the set of query-level casting vectors, the fourth set of vectors may be: the set of query-channel distribution vectors, the fifth set of vectors may be: DSSM token embedding vector set, the first model may be: the ngram2vec model, the second model may be: the FastText model, the third model may be a DSSM model, and the above-mentioned five sets of vectors may be obtained by other models (for example, the Bert model extracts token level embeddin vectors) or other technical means, which are not limited in this embodiment.

In the following, the extraction of five sets of vectors from the first model, the second model, and the third model will be described as an example.

S31, determining a first vector set and a second vector set corresponding to the word segmentation set by using a first model, and taking the first vector set and the second vector set as the first input set of the first sub-model.

S32, taking the word segmentation set and the paraphrasing set as a second input set of the first sub-model.

S33, determining a third vector set and a fourth vector set corresponding to the word segmentation set by using a second model, determining a fifth vector set corresponding to the word segmentation set by using a third model, and taking part or all of the first vector set, the second vector set, the third vector set, the fourth vector set and the fifth vector set as a third input set of the second sub model.

The first model in this embodiment may be: inputting each token in the word segmentation set into the ngram2vec model, extracting features of each token by the ngram2vec model from the dimension of the word to obtain token level embedding vectors corresponding to each token, extracting features of the token by the ngram2vec model from the dimension of the character to obtain char level embedding vectors corresponding to each token, taking all token level embedding vectors as a first vector set, taking all char level embedding vectors as a second vector set, taking the first vector set and the second vector set as a first input set, and taking all tokens and release tokens as a second input set.

Further, the second model may be: the FastText model, the third model may be a DSSM model; inputting all the token n in the word segmentation set into a FastText model, performing feature extraction on all the token n corresponding to the search text from the dimension of the text to obtain a query-level email vector corresponding to the search text, performing feature extraction on all the token n corresponding to the search text from the dimension of the text distribution by the FastText model to obtain a query-channel distribution vector corresponding to the search text, taking the query-level email vector corresponding to the search text as a third vector set, and taking the query-channel distribution vector corresponding to the search text as a fourth vector set; inputting all the token in the word segmentation set into a DSSM model, extracting features of all the token from the dimension of word segmentation semantics by the DSSM model to obtain DSSM token embedding vectors corresponding to all the token, taking all DSSM token embedding vectors as a fifth vector set, taking part or all of the first vector set, the second vector set, the third vector set, the fourth vector set and the fifth vector set as a third input set of a second sub-model, and taking vectors with multiple dimensions as the input of the second sub-model to enable the second sub-model to respectively identify search texts from char, token, query-level, query-channel and part or all of the dimensions in the DSSM token, so that the semantics of the search texts are enhanced in the identification process, and the accuracy of the search texts is improved.

For example, the third set of inputs may include: the first set of vectors, the second set of vectors, the third set of vectors, the fourth set of vectors, and the fifth set of vectors, the third set of inputs may further include: the first vector set, the second vector set, and the like, and the vector set specifically included in the third input set may be set according to actual requirements, which is not specifically limited in this embodiment.

Fig. 4 is a schematic flow chart of obtaining a third output result according to an embodiment of the present invention, as shown in fig. 4, specifically including:

s41, combining at least two vector sets in the third input set into a multidimensional feature vector.

Further, the dimension of the feature vector may be determined according to the number of vector sets in the third input set, for example, the multi-dimensional feature vector may be a two-dimensional feature vector or a five-bit feature vector, and the dimension of the feature vector may be set according to the actual requirement, which is not specifically limited in this embodiment.

S42, inputting the multidimensional feature vector into the second sub-model so that the second sub-model outputs the third output result, wherein the third output result comprises a prediction probability value corresponding to the search text and leaf node number features corresponding to the word segmentation set and the near-meaning word set.

In this embodiment, the second sub-model adopts an Xgboost model, and the number of trees in the Xgboost model may be set as follows: n_evastiators = 2000, the depth of the tree is set to: max_depth=7, and token level embedding vector, char level embedding vector, query-level compressing vector, query-channel vector and DSSM token embedding vector corresponding to each token in each search text are used as inputs of the Xgboost model, and the Xgboost model outputs the predicted probability value of the search text and leaf node number characteristics corresponding to the word segmentation set and the nearest meaning set.

Fig. 5 is a schematic flow chart of training the first sub-model according to an embodiment of the present invention, and as shown in fig. 5, specifically includes:

s51, inputting the first input set into the neural network part, so that the neural network part extracts local features of the word segmentation set from the first input set, fuses the local features to obtain global features of the word segmentation set, and obtains the first output result of the search text by combining the global features with the probability value in the third output result.

In this embodiment, the first input set (token level embedding vector) is input to the neural network part in the first sub-model, the local features in the token sequence are extracted through the CNN (convolutional neural network) in the neural network part, and then the global features are obtained by fusing the local features of the token sequence through the Attention in the neural network part.

Further, the predicted probability value output by the Xgboost model is spliced through the CNN, and finally, a first output result of the search text is obtained through three full-connection layers, wherein the first output result is as follows: the predicted results of the neural network portion of the text in the first sub-model are searched.

S52, inputting the second input set into the linear part, so that the linear part obtains the second output result of the search text according to the word segmentation set and the paraphrasing set and by combining the leaf node number characteristic in the third output result.

The linear part in the first sub-model splices a second input set (token and a release token) to the Xgboost model to output a segmented word set and leaf node number characteristics corresponding to the proximal word set, and a second output result of the search text is obtained, wherein the second output result is that: the prediction results of the linear portion of the text in the first sub-model are searched.

And S53, training the first sub-model according to each search text in the training samples.

S54, determining a fourth output result of the search text recognition model according to the first output result and the second output result.

The fourth output result corresponding to the first sub-model is: the fourth output result of the first sub-model is also the fourth output result of the search text recognition model.

logit＝logit _wide +logit _deep

Wherein, logic is the fourth output result, logic _deep For the first output result, logic _wide And outputting a result for the second output.

S55, constructing the loss function according to the first output result, the second output result and the fourth output result.

S56, training is assisted by the first sub-model based on the loss function until the loss function meets a preset convergence condition, and the first sub-model training is determined to be completed.

The loss function may be:

Loss＝CE(logit _wide )+CE(logit _deep )+CE(logit)

where CE represents cross entropy.

According to the method for constructing the search text recognition model, provided by the embodiment of the invention, through word segmentation and expansion processing of the search text, the semantics of the search text are not deficient, the semantic range of the search text is expanded, the generalized characteristics of the search text and the linear part of the neural network part in the first sub-model are combined to enhance the memory capacity of the search text, and the linear part of the first sub-model is enhanced through the second sub-model, so that the short and small search text with the deficient semantics can be recognized accurately.

Fig. 7 is a flow chart of a method for identifying a search text according to an embodiment of the present invention, as shown in fig. 7, where the method specifically includes:

s71, determining a word segmentation set and a near meaning word set corresponding to the search text to be identified.

S72, multi-dimensional feature extraction is carried out on the word segmentation set to obtain a first input set and a third input set, wherein the first input set comprises two preset feature vector sets, and the third input set at least comprises the first input set.

In this embodiment, the first input set includes two preset feature vector sets, where the feature vector sets may be: the method comprises the steps of selecting a word dimension vector representation corresponding to each word in a word segmentation set and a character dimension vector representation corresponding to each word in the word segmentation set.

The third input set may further include, in addition to the first input set: a vector representation of a text dimension corresponding to the search text, a vector representation of a text distribution dimension corresponding to the search text, a portion or all of a vector representation of a word segmentation semantic corresponding to each word segment in the set of word segments (where a portion may be zero).

Further, for the first input set and the third input set, which are similar to the first input set and the third input set in fig. 3, specific reference may be made to the related description in fig. 3, and details are not repeated here.

S73, taking the word segmentation set and the paraphrasing set as a second input set.

S74, inputting the third input set into the second submodel to obtain a third output result.

S75, inputting the first input set, the second input set and the third output result into the first sub-model so that the first sub-model outputs the recognition result of the search text.

In this embodiment, the search text recognition model constructed in fig. 1 to 5 is used to perform recognition of the search text, and the process of search text recognition is partially similar to the construction process of the search text recognition model.

In an alternative of the embodiment of the present invention, S71 may include the following sub-steps:

s711 (not shown), word segmentation processing is performed on the search text, and a word segmentation set containing a plurality of words and/or phrases is obtained.

S712 (not shown in the figure), performing semantic matching on each word and/or phrase in the word segmentation set, and using one or more hyponyms with each word meaning similarity greater than a set threshold, and/or using one or more hyponym groups with each phrase meaning similarity greater than the set threshold as a hyponym set corresponding to the word segmentation set.

In an alternative of the embodiment of the present invention, S75 may include the following sub-steps:

s751 (not shown in the drawing), inputting the first input set and the third output result to the neural network section, so that the neural network section outputs the first output result corresponding to the search text.

S752 (not shown in the drawing), the second input set and the third output result are input to the linear section so that the linear section outputs the second output result corresponding to the search text.

S753 (not shown in the figure), the first output result and the second output result are taken as recognition results of the search text.

S71 (S711-S712) is similar to S21-S22 in FIG. 2, S72-S73 is similar to S12 in FIG. 1/S31-S33 in FIG. 3, S74 is similar to S41-S42 in FIG. 4, S75 (S751-S755) is similar to S51-S54 in FIG. 5, and specific reference is made to the relevant descriptions of FIGS. 2-5, and details are omitted herein.

Fig. 8 is a schematic structural diagram of a device for identifying a search text according to an embodiment of the present invention, where, as shown in fig. 8, the device is applied to training of a search text identification model, and the search text identification model includes a first sub-model and a second sub-model, and specifically includes:

the set determining module 81 is configured to determine a word segmentation set and a paraphrasing set corresponding to a search text to be identified;

the extracting module 82 is configured to perform multidimensional feature extraction on the word segmentation set to obtain a first input set and a third input set, where the first input set includes two preset feature vector sets, and the third input set includes at least the first input set;

the set determining module 81 is further configured to take the word segmentation set and the paraphrasing set as a second input set;

an output determining module 83, configured to input the third input set into the second sub-model, to obtain a third output result;

the result determining module 84 is configured to input the first input set, the second input set, and the third output result into the first sub-model, so that the first sub-model outputs the recognition result of the search text.

The device for identifying a search text provided in this embodiment may be a device for identifying a search text as shown in fig. 8, and may perform all steps of the method for identifying a search text as shown in fig. 7, so as to achieve the technical effects of the method for identifying a search text as shown in fig. 7, and the description thereof will be specifically referred to in fig. 7, and is omitted herein for brevity.

Fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present invention, and the computer device 900 shown in fig. 9 includes: at least one processor 901, memory 902, at least one network interface 904, and other user interfaces 903. The various components in computer device 900 are coupled together by a bus system 905. It is appreciated that the bus system 905 is employed to enable connected communications between these components. The bus system 905 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled as bus system 905 in fig. 9.

The user interface 903 may include, among other things, a display, a keyboard, or a pointing device (e.g., a mouse, a trackball, a touch pad, or a touch screen, etc.).

It will be appreciated that the memory 902 in embodiments of the invention can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM) which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (Double Data Rate SDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), and Direct memory bus RAM (DRRAM). The memory 902 described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

In some implementations, the memory 902 stores the following elements, executable units or data structures, or a subset thereof, or an extended set thereof: an operating system 9021 and application programs 9022.

The operating system 9021 includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, for implementing various basic services and processing hardware-based tasks. The application 9022 includes various application programs such as a Media Player (Media Player), a Browser (Browser), and the like for realizing various application services. A program for implementing the method of the embodiment of the present invention may be included in the application 9022.

In the embodiment of the present invention, by calling a program or an instruction stored in the memory 902, specifically, a program or an instruction stored in the application program 9022, the processor 901 is configured to execute method steps provided by each method embodiment, for example, including:

determining a word segmentation set and a near meaning word set corresponding to a search text to be identified; extracting multidimensional features of the word segmentation set to obtain a first input set and a third input set, wherein the first input set comprises two preset feature vector sets, and the third input set at least comprises the first input set; taking the word segmentation set and the paraphrasing set as a second input set; inputting the third input set into a second sub-model to obtain a third output result; and inputting the first input set, the second input set and the third output result into the first sub-model so that the first sub-model outputs the recognition result of the search text.

In one possible implementation manner, word segmentation processing is performed on the search text to obtain a word segmentation set containing a plurality of words and/or phrases; and carrying out semantic matching on each word and/or phrase in the word segmentation set, and taking one or more hyponyms with the semantic similarity of each word being larger than a set threshold value and/or one or more hyponym groups with the semantic similarity of each phrase being larger than the set threshold value as the hyponym set corresponding to the word segmentation set.

In one possible implementation, the first set of inputs includes: a first set of vectors and a second set of vectors; the third set of inputs includes at least two of: a first set of vectors, a second set of vectors, a third set of vectors, a fourth set of vectors, or a fifth set of vectors; the first set of vectors includes: vector representation of word dimensions corresponding to each word in the word segmentation set; the second set of vectors includes: vector representation of character dimensions corresponding to each word in the word segmentation set; the third vector set includes: a vector representation of a text dimension corresponding to the search text; the fourth set of vectors includes: vector representation of text distribution dimensions corresponding to the search text; the fifth set of vectors includes: and each word in the word segmentation set corresponds to a vector representation of word segmentation semantics.

inputting the first input set and the third output result to the neural network part so that the neural network part outputs the first output result corresponding to the search text; inputting the second input set and the third output result to the linear part so that the linear part outputs the second output result corresponding to the search text; and taking the first output result and the second output result as the recognition result of the search text.

In one possible implementation manner, the second input set is input to the linear part, so that the linear part obtains the second output result of the search text according to the word segmentation set and the paraphrasing set and in combination with the leaf node numbering characteristic in the third output result.

In one possible implementation manner, aiming at any search text in a training sample, word segmentation is carried out on the search text to obtain a word segmentation set corresponding to the search text, and word expansion processing is carried out on each word in the word segmentation set to obtain a near meaning word set corresponding to the word segmentation set; determining a first input set of the first sub-model according to the word segmentation set, determining a second input set of the first sub-model according to the word segmentation set and the paraphrasing set, and determining a third input set of the second sub-model according to the word segmentation set; inputting the third input set into the second sub-model so that the second sub-model outputs a corresponding third output result; and training the first sub-model by using the first input set, the second input set and the third output result, and determining that the training of the search text recognition model is completed when the training of the first sub-model is completed.

The method disclosed in the above embodiment of the present invention may be applied to the processor 901 or implemented by the processor 901. Processor 901 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in the processor 901 or instructions in the form of software. The processor 901 may be a general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), an off-the-shelf programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software elements in a decoding processor. The software elements may be located in a random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory 902, and the processor 901 reads information in the memory 902 and performs the steps of the above method in combination with its hardware.

It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or a combination thereof. For a hardware implementation, the processing units may be implemented within one or more application specific integrated circuits (Application Specific Integrated Circuits, ASIC), digital signal processors (Digital Signal Processing, DSP), digital signal processing devices (dspev, DSPD), programmable logic devices (Programmable Logic Device, PLD), field programmable gate arrays (Field-Programmable Gate Array, FPGA), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units designed to perform the functions described herein, or a combination thereof.

For a software implementation, the techniques described herein may be implemented by means of units that perform the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.

The computer device provided in this embodiment may be a computer device as shown in fig. 9, and may perform all steps of the method for identifying a search text as shown in fig. 7, so as to achieve the technical effects of the method for identifying a search text as shown in fig. 7, and the description thereof will be specifically referred to fig. 7, and is omitted herein for brevity.

The embodiment of the invention also provides a storage medium (computer readable storage medium). The storage medium here stores one or more programs. Wherein the storage medium may comprise volatile memory, such as random access memory; the memory may also include non-volatile memory, such as read-only memory, flash memory, hard disk, or solid state disk; the memory may also comprise a combination of the above types of memories.

When one or more programs in the storage medium are executable by one or more processors, the above-described method for identifying the search text performed on the identification device side of the search text is implemented.

The processor is configured to execute a search text recognition program stored in the memory, so as to implement the following steps of a search text recognition method executed on a search text recognition device side:

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of function in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. The method for identifying the search text is characterized in that the search text identification model comprises a first sub-model and a second sub-model, and the first sub-model is as follows: a Wide & Deep model, wherein the Wide & Deep model comprises a neural network part and a linear part, and the second sub-model is as follows: the third output result output by the Xgboost model comprises a probability value corresponding to the search text and leaf node number characteristics corresponding to the word segmentation set and the near-meaning word set, and the method comprises the following steps:

2. The method of claim 1, wherein the determining the set of segmentations and the set of paraphrasing corresponding to the search text to be identified comprises:

word segmentation processing is carried out on the search text to obtain a word segmentation set containing a plurality of words and/or phrases;

and carrying out semantic matching on each word and/or phrase in the word segmentation set, and taking one or more hyponyms with the semantic similarity of each word being larger than a set threshold value and/or one or more hyponym groups with the semantic similarity of each phrase being larger than the set threshold value as the hyponym set corresponding to the word segmentation set.

3. The method of claim 1, wherein the first set of inputs comprises: a first set of vectors and a second set of vectors;

the fourth set of vectors includes: a vector representation of a text distribution dimension corresponding to the search text;

4. The method of claim 1, wherein the step of determining the position of the substrate comprises,

Inputting the first input set and the third output result to the neural network part so that the neural network part outputs a first output result corresponding to the search text;

inputting the second input set and the third output result to the linear part so that the linear part outputs a second output result corresponding to the search text;

5. The method of claim 4, wherein the step of determining the position of the first electrode is performed,

6. The method of claim 5, wherein the inputting the second set of inputs and the third output result to the linear portion to cause the linear portion to output the second output result corresponding to the search text comprises:

7. The method of claim 1, wherein the search text recognition model is constructed by:

aiming at any search text in a training sample, word segmentation is carried out on the search text to obtain a word segmentation set corresponding to the search text, and word expansion processing is carried out on each word segmentation in the word segmentation set to obtain a near meaning word set corresponding to the word segmentation set;

determining a first input set of the first sub-model according to the word segmentation set, determining a second input set of the first sub-model according to the word segmentation set and the paraphrasing set, and determining a third input set of the second sub-model according to the word segmentation set;

Inputting the third input set into the second sub-model so that the second sub-model outputs a corresponding third output result;

and training the first sub-model by using the first input set, the second input set and the third output result, and determining that the training of the search text recognition model is completed when the training of the first sub-model is completed.

8. The method of claim 7, wherein the first sub-model is: a Wide & Deep model including a neural network part and a linear part;

inputting the first input set and the third output result to the neural network part, and training the neural network part;

and inputting the second input set and the third output result into the linear part, and training the linear part.

9. The method of claim 8, wherein the method further comprises:

constructing a loss function according to a first output result of the neural network part and a second output result of the linear part;

Assisting the first sub-model to train by using the loss function;

10. The device for identifying the search text is characterized in that the search text identification model comprises a first sub-model and a second sub-model, and the first sub-model is as follows: a Wide & Deep model, wherein the Wide & Deep model comprises a neural network part and a linear part, and the second sub-model is as follows: the third output result output by the Xgboost model comprises a probability value corresponding to the search text and leaf node number characteristics corresponding to the word segmentation set and the near-meaning word set, and the device comprises:

11. A computer device, comprising: the method for identifying the search text comprises a processor and a memory, wherein the processor is used for executing a construction program of a search text identification model stored in the memory so as to realize the identification method of the search text according to any one of claims 1-9.

12. A storage medium storing one or more programs executable by one or more processors to implement the method of identifying search text of any one of claims 1-9.