CN112287669B

CN112287669B - Text processing method and device, computer equipment and storage medium

Info

Publication number: CN112287669B
Application number: CN202011574328.6A
Authority: CN
Inventors: 郑哲; 李松如; 张秋实; 刘云峰
Original assignee: Shenzhen Zhuiyi Technology Co Ltd
Current assignee: Shenzhen Zhuiyi Technology Co Ltd
Priority date: 2020-12-28
Filing date: 2020-12-28
Publication date: 2021-05-25
Anticipated expiration: 2040-12-28
Also published as: CN112287669A

Abstract

The application relates to a text processing method, a text processing device, a computer device and a storage medium. The method comprises the following steps: acquiring a first training text; if the length of the first training text is larger than a length threshold value, removing semantic irrelevant words in the first training text to obtain a processed text; obtaining word vectors corresponding to all the participles in the processed text to obtain an initial word vector sequence; counting according to the initial word vector sequence to obtain a statistical word vector corresponding to a removed word in the first training text; adding the statistical word vector into the initial word vector sequence according to the position of the removed word in the first training text to obtain a target word vector sequence; and training the text recognition model according to the target word vector sequence to obtain the trained text recognition model. By adopting the method, the recognition accuracy of the text recognition model can be improved.

Description

Text processing method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a text processing method, an apparatus, a computer device, and a storage medium.

Background

With the development of scientific technology, text recognition is required in many cases, for example, recognition of an intention corresponding to a text or translation of a text.

In the conventional technology, when a text recognition model is trained, a training text can be obtained, and after the text recognition model is obtained by training the text recognition model through the training text and a corresponding label, the text recognition can be performed on the text based on the text recognition model, however, the problem that the accuracy of the trained text recognition model is low frequently occurs.

Disclosure of Invention

In view of the foregoing, it is necessary to provide a text processing method, apparatus, computer device and storage medium for solving the above technical problems.

A method of text processing, the method comprising: acquiring a first training text; if the length of the first training text is larger than a length threshold value, removing semantic irrelevant words in the first training text to obtain a processed text; obtaining word vectors corresponding to all the participles in the processed text to obtain an initial word vector sequence; counting according to the initial word vector sequence to obtain a statistical word vector corresponding to a removed word in the first training text; adding the statistical word vector into the initial word vector sequence according to the position of the removed word in the first training text to obtain a target word vector sequence; and training the text recognition model according to the target word vector sequence to obtain the trained text recognition model.

In some embodiments, if the length of the first training text is greater than a length threshold, removing semantic irrelevant words in the first training text, and obtaining a processed text includes: if the length of the first training text is larger than a length threshold value, acquiring a target entity in the first training text, wherein the target entity comprises at least one of a head entity in the first training text or a tail entity in the first training text; taking the end word corresponding to the target entity in the first training text as a semantic irrelevant word which does not meet the semantic requirement; and removing the semantic irrelevant words in the first training text to obtain a processed text.

In some embodiments, the step of taking the end word corresponding to the target entity in the first training text as a semantic irrelevant word which does not meet semantic requirements includes at least one of the following steps; if the target entity comprises a first entity in the first training text, taking a word in the first training text before the first entity as a semantic irrelevant word which does not meet semantic requirements; and if the target entity comprises a tail entity in the first training text, taking words behind the tail entity in the first training text as semantic irrelevant words which do not meet semantic requirements.

In some embodiments, the obtaining a statistical word vector corresponding to a removed word in the first training text by performing statistics according to the initial word vector sequence includes: counting vector values at the same position in each word vector of the initial word vector sequence to obtain a statistical value corresponding to each vector dimension; and determining the vector value of the corresponding dimension in the statistical word vector according to the statistical value corresponding to the vector dimension to obtain the statistical word vector.

In some embodiments, the statistics include a mean value and a standard deviation, the determining, according to the statistics corresponding to the vector dimensions, vector values of corresponding dimensions in the statistical word vector, where the vector values corresponding to the respective dimensions constitute the statistical word vector, includes: multiplying the standard deviation by a target coefficient of a corresponding vector dimension to obtain a product; and subtracting the product from the mean value to obtain the vector value of the corresponding dimension in the statistical word vector.

In some embodiments, the step of obtaining the target coefficient comprises: acquiring the distribution quantity of vector values corresponding to the vector dimensions in each numerical range in the initial word vector sequence; and determining the target coefficient according to the distribution quantity of each numerical range.

In some embodiments, the processing word vectors corresponding to the respective segmented words in the text is obtained based on a trained word vector model, and the step of training the word vector model includes: acquiring a second training text, and acquiring a special entity in the second training text; acquiring a preset word conditional probability as a word conditional probability between the proprietary entity and a corresponding neighboring entity, wherein the neighboring entity corresponding to the proprietary entity is an entity in the second training text, and the distance between the neighboring entity and the proprietary entity is smaller than a distance threshold; and training a word vector model according to the word conditional probability and the second training text to obtain a trained word vector model.

A text processing apparatus, the apparatus comprising: the first training text acquisition module is used for acquiring a first training text; the processed text obtaining module is used for removing the semantic irrelevant words in the first training text to obtain a processed text if the length of the first training text is greater than a length threshold; an initial word vector sequence obtaining module, configured to obtain a word vector corresponding to each participle in the processed text, and obtain an initial word vector sequence; a statistical word vector obtaining module, configured to perform statistics according to the initial word vector sequence to obtain a statistical word vector corresponding to a removed word in the first training text; a target word vector sequence obtaining module, configured to add the statistical word vector to the initial word vector sequence according to the position of the removed word in the first training text, so as to obtain a target word vector sequence; and the training module is used for training the text recognition model according to the target word vector sequence to obtain the trained text recognition model.

In some embodiments, the process text deriving module comprises: a target entity obtaining unit, configured to obtain a target entity in the first training text if the length of the first training text is greater than a length threshold, where the target entity includes at least one of a head entity in the first training text or a tail entity in the first training text; a semantic irrelevant word acquiring unit, configured to use an end word corresponding to the target entity in the first training text as a semantic irrelevant word that does not meet semantic requirements; and the removing unit is used for removing the semantic irrelevant words in the first training text to obtain a processed text.

In some embodiments, the semantic irrelevant word obtaining unit is configured to perform at least one of the following steps; if the target entity comprises a first entity in the first training text, taking a word in the first training text before the first entity as a semantic irrelevant word which does not meet semantic requirements; and if the target entity comprises a tail entity in the first training text, taking words behind the tail entity in the first training text as semantic irrelevant words which do not meet semantic requirements.

In some embodiments, the statistical word vector derivation module comprises: a statistic value obtaining unit, configured to perform statistics on vector values at the same position in each word vector of the initial word vector sequence to obtain a statistic value corresponding to each vector dimension; and the statistical word vector obtaining unit is used for determining the vector value of the corresponding dimension in the statistical word vector according to the statistical value corresponding to the vector dimension to obtain the statistical word vector.

In some embodiments, the statistical values include a mean and a standard deviation, and the statistical word vector derivation unit is configured to: multiplying the standard deviation by a target coefficient of a corresponding vector dimension to obtain a product; and subtracting the product from the mean value to obtain the vector value of the corresponding dimension in the statistical word vector.

In some embodiments, the module to derive the target coefficients is to: acquiring the distribution quantity of vector values corresponding to the vector dimensions in each numerical range in the initial word vector sequence; and determining the target coefficient according to the distribution quantity of each numerical range.

In some embodiments, the processing of the word vector corresponding to each participle in the text is obtained based on a trained word vector model, and the module for training the word vector model includes: the special entity acquisition module is used for acquiring a second training text and acquiring a special entity in the second training text; a word conditional probability obtaining module, configured to obtain a preset word conditional probability as a word conditional probability between the proprietary entity and a corresponding neighboring entity, where the neighboring entity corresponding to the proprietary entity is an entity in the second training text, and a distance between the neighboring entity and the proprietary entity is smaller than a distance threshold; and the word vector model obtaining module is used for training the word vector model according to the word conditional probability and the second training text to obtain the trained word vector model.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program: acquiring a first training text; if the length of the first training text is larger than a length threshold value, removing semantic irrelevant words in the first training text to obtain a processed text; obtaining word vectors corresponding to all the participles in the processed text to obtain an initial word vector sequence; counting according to the initial word vector sequence to obtain a statistical word vector corresponding to a removed word in the first training text; adding the statistical word vector into the initial word vector sequence according to the position of the removed word in the first training text to obtain a target word vector sequence; and training the text recognition model according to the target word vector sequence to obtain the trained text recognition model.

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of: acquiring a first training text; if the length of the first training text is larger than a length threshold value, removing semantic irrelevant words in the first training text to obtain a processed text; obtaining word vectors corresponding to all the participles in the processed text to obtain an initial word vector sequence; counting according to the initial word vector sequence to obtain a statistical word vector corresponding to a removed word in the first training text; adding the statistical word vector into the initial word vector sequence according to the position of the removed word in the first training text to obtain a target word vector sequence; and training the text recognition model according to the target word vector sequence to obtain the trained text recognition model.

According to the text processing method, the text processing device, the computer equipment and the storage medium, as for the training text with the length larger than the length threshold value, the words irrelevant to the semantics in the training text can be removed to obtain the processed text, the number of irrelevant words in the text can be reduced, the situation that the word vector sequences between sentences expressing the semantic meaning are similar but the words expressing the semantic meaning are dissimilar is similar, the text recognition model cannot accurately learn the capacity of distinguishing the sentences, the statistical word vectors corresponding to the removed words are obtained through the reserved word vector sequences of the words, the target word vector sequences are made to accord with the semantic information of the text, the training is carried out based on the target word vector sequences, and the accuracy of the text recognition model obtained through the training is improved.

Drawings

FIG. 1 is a diagram of an application environment of a text processing method in some embodiments;

FIG. 2 is a flow diagram that illustrates a method for text processing in some embodiments;

FIG. 3 is a flowchart illustrating a step of obtaining a statistical word vector packet corresponding to a removed word in a first training text according to statistics performed on an initial word vector sequence in some embodiments;

FIG. 4 is a schematic flow chart of the training word vector model step in some embodiments;

FIG. 5 is a block diagram of a text processing apparatus in some embodiments;

FIG. 6 is a block diagram of a module for processing text in some embodiments;

FIG. 7 is a block diagram of modules that train a word vector model in some embodiments;

FIG. 8 is a diagram of the internal structure of a computer device in some embodiments.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The text processing method provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The terminal 102 may send an instruction for training the model to the server, and the server 104 executes the text processing method of the present application in response to the instruction to obtain the trained text recognition model, and returns a prompt message indicating that the model training is completed to the terminal 102. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.

In some embodiments, as shown in fig. 2, a text processing method is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:

step S202, a first training text is obtained.

The training text is a text for model training, and the language of the first training text may be determined according to actual needs, for example, the first training text may be a chinese sentence, or a japanese sentence. A first training text may include a plurality of segments, and the first training text may be segmented to obtain the plurality of segments. Plural means at least two. The segmentation mode can be a dictionary-based or statistic-based segmentation mode. For example, assuming that the first training text is "sunny day", the segmented word segmentation sequence is "sunny day/yes/sunny day".

Step S204, if the length of the first training text is larger than the length threshold, removing the semantic irrelevant words in the first training text to obtain a processed text.

Wherein the length of the first training text can be represented by the number of the participles in the first training text. The length threshold may be set as desired, for example 30. Semantically unrelated words refer to words that have little or no effect on the semantic response of the first training text. For example, the semantically unrelated word may be a helper. If the length of the first training text is smaller than the length threshold, word vectors corresponding to the participles in the first training text can be obtained, and a target word vector sequence is obtained.

In some embodiments, the processed text may be obtained by performing truncation processing on the first training text. For example, the position of the entity in the first training text is obtained, and the word before the first entity is used as a semantic irrelevant word, so as to perform truncation.

In some embodiments, the processed text may be obtained by performing truncation processing on the first training text. For example, the position of the entity in the first training text is obtained, and the words after the last entity are used as the semantic irrelevant words.

In some embodiments, the participles in the first training text may be compared with words in a thesaurus of semantically unrelated words, and the words with consistent comparison may be used as semantically unrelated words.

In some embodiments, a word that does not satisfy the semantic requirement may be used as a semantic independent word, and the semantic requirement may be set as needed, for example, the semantic independent word may be a specific word type, and the semantic independent word is a semantic-free word that does not satisfy the semantic requirement.

In some embodiments, if the length of the first training text is greater than the length threshold, a target entity in the first training text is obtained, where the target entity includes at least one of a head entity in the first training text or a tail entity in the first training text; and taking the end word corresponding to the target entity in the first training text as a semantic irrelevant word which does not meet the semantic requirement.

An Entity (Entity) is a thing having a specific meaning, and may include at least one of proper nouns such as a name of a person, a name of a place, or a name of an organization. The first entity is the first entity in the training text and the last entity in the training text. A training text may include one or more entities. For example, assuming that the training text is "credit card to apply for chinese bank", the entity may include "chinese bank" and "credit card". The term end means the term at the end. Either the front or the back of the text.

Specifically, at least one of a first entity in the first training text or a last entity in the first training text may be obtained, then, a word before the first entity is used as an end word, a word after the last entity is used as an end word, and the end words are removed to obtain a text in the middle part as a processed text.

In some embodiments, regarding the end word corresponding to the target entity in the first training text as the semantic irrelevant word which does not meet the semantic requirement includes at least one of the following steps; if the target entity comprises a first entity in the first training text, taking a word in the first training text before the first entity as a semantic irrelevant word which does not meet the semantic requirement; and if the target entity comprises a tail entity in the first training text, taking the words behind the tail entity in the first training text as semantic irrelevant words which do not meet the semantic requirements.

In particular, the server may perform at least one of the steps of removing words preceding the leading entity or removing words preceding the trailing entity. For example, assuming that the first training text is "ABCDEFG", assuming that the head entity is B and the tail entity is G, the process text is "BCDEFG". That is, for the first entity, the word before the first entity is taken as the terminal word. For the tail entity, the words after the tail entity are taken as end words.

Step S206, acquiring word vectors corresponding to each participle in the processed text to obtain an initial word vector sequence.

The word segmentation is obtained by segmenting the text. The word vector may be a distributed representation vector (distribution representation) or obtained through ONE-HOT (ONE-HOT) encoding, the word vector may be obtained in advance by training a word vector model, the word vector model may be a word to vector model, and the word to vector is a tool for converting words into vector form.

Specifically, the server may perform word segmentation on the processed text to obtain a word segmentation sequence. Word vectors representing word segments are obtained, and the word vectors form an initial word vector sequence according to the sequence in the processed text.

And S208, counting according to the initial word vector sequence to obtain a statistical word vector corresponding to the removed word in the first training text.

Wherein, the statistics may refer to obtaining values representing characteristics of the initial word vector sequence for calculation. For example, at least one of the mean value, the standard deviation, or the median value may be obtained, and a statistical word vector satisfying the distribution rule may be obtained by statistically analyzing the distribution rule of the initial word vector sequence. For example, a gaussian distribution rule corresponding to the initial word vector sequence may be obtained, and a statistical vector satisfying the gaussian distribution rule may be obtained. If the mean and the standard deviation can be obtained, a statistical vector is obtained based on the mean and the standard deviation.

Specifically, after the server obtains the initial word vector sequence, the server may perform statistics on the initial word vector sequence to obtain a statistical word vector satisfying a statistical rule, and the statistical word vector is used as a word vector corresponding to a word that has been removed in the first training text.

Step S210, adding the statistical word vector into the initial word vector sequence according to the position of the removed word in the first training text to obtain a target word vector sequence.

Specifically, after the statistical word vector is obtained, the statistical word vector may be added to the corresponding position according to the position of the removed word in the first training text, so as to obtain a target word vector sequence. As a practical example, for the first training text, it is assumed to be "ABCDE", where one letter represents one word segmentation, and assuming that a and E are removed, an initial word vector sequence corresponding to BCD is obtained, for example, after an initial word vector sequence composed of vector 2, vector 3, and vector 4 is obtained, it is assumed that a statistical word vector corresponding to a is vector 1, and a systematic word vector corresponding to E is vector 5, and a target word vector sequence is a vector sequence in which vector 1, vector 2, vector 3, vector 4, and vector 5 are sequentially arranged.

And S212, training the text recognition model according to the target word vector sequence to obtain the trained text recognition model.

The text recognition model can comprise a pre-training language model and a task model set according to a text recognition task. For example, the pre-training language model may be a BERT (Bidirectional Encoder from converters) model. Before the training of the downstream task model, the pre-trained BERT model can be fine-tuned, and the task model is set according to specific needs, such as a model for classifying the emotion of the text or a model for translating.

Specifically, the target word vector sequence may be input into a pre-training language model for feature extraction, so as to obtain a semantic feature vector sequence, and the task model may perform recognition based on the semantic feature vector sequence, so as to obtain a predicted recognition result, where the predicted recognition result may be, for example, a result of performing translation or text classification. And after the predicted identification result is obtained, obtaining a model loss value based on the difference between the standard identification result and the predicted identification result, wherein the difference and the model loss value have positive correlation. And after the model loss value is obtained, adjusting the model parameters by adopting a gradient descent method according to the model loss value, namely adjusting the parameters of the model towards the direction of descending the model loss value until the model converges to obtain the trained text recognition model. Wherein, the model convergence may mean that the model loss value is smaller than a preset loss value. The standard recognition result can be considered the correct processing result and can be called a tag.

In some embodiments, when the pre-training language model performs feature extraction, the target word vector sequence is encoded to obtain a semantic feature vector sequence, where the semantic feature vector sequence includes a semantic vector corresponding to each word vector, and the semantic vector is obtained by fusing the semantic feature vector sequence, that is, fusing semantic information of each word vector in the word vector sequence.

In the text processing method, as the training text with the length larger than the length threshold value can be removed from the words irrelevant to the semantics in the training text to obtain the processed text, the number of irrelevant words in the text can be reduced, the situation that the words irrelevant to the semantics are similar but word vector sequences between sentences expressing substantial semantics are dissimilar are similar can be avoided, so that the text recognition model cannot learn the capacity of distinguishing the sentences, statistical word vectors corresponding to the removed words can be obtained through the reserved word vector sequences of the words, so that the target word vector sequence conforms to the characteristics of the original text, the training is carried out based on the target word vector sequence, and the accuracy of the trained text recognition model is improved.

In some embodiments, as shown in fig. 3, performing statistics according to the initial word vector sequence to obtain a statistical word vector corresponding to a removed word in the first training text includes:

step S302, counting vector values at the same position in each word vector of the initial word vector sequence to obtain a statistical value corresponding to each vector dimension.

In particular, a vector is multidimensional, which may be 8-dimensional, for example. The dimensions of the word vectors in the word vector sequence are the same, so that the value of each dimension is counted to obtain a statistical value corresponding to each dimension, which may be at least one of a mean value or a standard deviation, for example. For example, for a first dimension, the value of each word vector in the initial word vector sequence in the first dimension may be obtained, and the sum is performed to obtain the average value as the statistical value of the first dimension.

Step S304, determining the vector value of the corresponding dimension in the statistical word vector corresponding to the removed word according to the statistical value corresponding to the vector dimension, and obtaining the statistical word vector.

Specifically, after obtaining the statistical value, the statistical value may be used as a vector value, or further calculation may be performed based on the statistical value to obtain the vector value. For example, the statistical value may be a mean value, and the mean value may be used as a vector value. The statistical values may also include a mean and a standard deviation from which the vector values may be derived.

For example, assuming that the statistical value is a mean value and the initial word vector sequence has two vectors, the two vectors can be added together, and the sum is divided by 2 to obtain the statistical word vector. If the vectors (1, 2, 3) and (2, 3, 4) are added to obtain the vector (3, 5, 7), the statistical word vector is (3, 5, 7) divided by 2 to obtain (1.5, 2.5, 3.5).

In some embodiments, the statistical value may include a mean value and a standard deviation, the standard deviation may be multiplied by a target coefficient corresponding to the vector dimension to obtain a product, and the mean value is subtracted from the product to obtain the statistical value. Wherein the target coefficient may be set according to a preset, for example, may be 1.5. The target coefficients may also be derived from the distribution of vector values of the vector dimensions. The distribution number of the vector values corresponding to each vector dimension in each numerical range in the initial word vector sequence can be obtained, and the target coefficient corresponding to the vector dimension is determined according to the distribution number in each numerical range.

Specifically, for each vector dimension, the corresponding target coefficients may be different, the mean and variance of the vector values of the vector dimension may be calculated, the distribution of the vector values of the dimension in each numerical range is calculated, and the coefficient corresponding to the range with the smallest distribution number is obtained as the target coefficient. The range can be determined according to the distance from the mean value, and the range can be determined according to the number of standard deviations from the mean value. For example, assuming a mean value of u and a standard deviation of a, u-a to u are one range and u-2a to u-a are another range. The number of distributions of vector values in each range can be calculated, and the coefficient corresponding to the range with the minimum number of distributions can be obtained as the target coefficient. For example, assuming that the distribution number corresponding to the value range u-a to u-2a is the minimum, the target coefficient may be the average value of the coefficient 1 and the coefficient 2, which is 1.5, and thus the statistical value is u-1.5 a. The standard deviation is multiplied by the target coefficient to obtain a product, the product is subtracted from the mean value to obtain a statistical value, and the statistical value can accord with the Gaussian distribution rule of the initial word vector sequence, so that the statistical word vector can be more closely fitted with the semantics of the text.

In some embodiments, a gaussian distribution density function corresponding to the initial word vector sequence may also be determined, and vector values of word vectors satisfying the gaussian distribution density function may be randomly generated based on the gaussian distribution density function, so as to obtain statistical word vectors based on the vector values, so that the generated statistical word vectors satisfy the gaussian distribution corresponding to the initial word vector sequence. For example, if the mean of a gaussian statistical distribution obtained from a dimension of the initial word vector sequence is 0 and the variance is 1, the randomly generated value may be 0.1, 0.001234, or 0.010334.

In some embodiments, as shown in fig. 4, the word vector corresponding to each participle in the text is obtained based on a word vector model, and the step of training the word vector model includes:

step S402, a second training text is obtained, and a special entity in the second training text is obtained.

The specific entity is a preset specific entity, and may be a specific word corresponding to a specific professional field, for example, "platinum credit card" and "black credit card. The second training text may be the same as or different from the first training text.

Specifically, by training the word vector model, a word vector can be obtained. The word vector model may have been preliminarily trained prior to training the word vector model with the second training sample. For example, the word vector model may be a word vector representation of a word that conforms to the general field that has been trained using the text of the general field before training with the second training sample. The word vector model needs to be used in a professional field, such as a financial field, and therefore training needs to be continued according to a training text in the financial field, so that the word vector model is more adaptive to the financial field, and therefore the second training text is a text corresponding to the professional field, such as the financial field.

In some embodiments, the proprietary entity may be an entity in a professional domain, for example in the financial domain, the proprietary entity may be a "bank" and a "credit card".

Step S404, obtaining a preset word conditional probability as a word conditional probability between the proprietary entity and a corresponding neighboring entity, where the neighboring entity corresponding to the proprietary entity is an entity in the second training text whose distance from the proprietary entity is less than a distance threshold.

The term conditional probability is the probability of simultaneous occurrence in one text from word to word. For example, for words a and b, the probability of word a occurring is referred to as the conditional probability of a with respect to b, on the premise that word b has already occurred in one text. The distance threshold may be set as desired, and may be 2, for example, i.e., an entity less than 2 words away from the proprietary entity may be considered as a neighbor entity of the proprietary entity. The preset word conditional probability may be preset. It can be preset that the conditional probabilities of words for the same proper noun are the same as those of the adjacent entities, for example, the conditional probability of words for the proper entity "platinum credit card" and the adjacent entity "money bank" is one fifth. The conditional probability of the preset words of the 'platinum credit card' and the adjacent entity 'China Bank' is also one fifth.

And step S406, training the word vector model according to the word conditional probability and the second training text to obtain a trained word vector model.

The word vector model is a model trained based on conditional probability, and may be a word to vector model, for example. For a model trained based on conditional probabilities, in general, the model is trained by the conditional probabilities among words counted by a corpus, and therefore the conditional probabilities corresponding to the proprietary entities are different according to the corpus. However, the semantics of the proprietary entity is more relevant to the types of the neighboring entities, and the relevance to what the specific neighboring entity is relatively small, so that the training of the word vector model can be more accurate by setting the preset word conditional probability, and for the proprietary entity, not adopting the probability of the corpus but adopting the preset probability. It will be appreciated that for non-proprietary entities, the word vector model may be trained using the conditional probabilities of the corpus.

In some embodiments, if the entity type of the neighboring entity is a preset type, the preset term conditional probability may be obtained as the term conditional probability between the specific entity and the corresponding neighboring entity. For example, for a proprietary entity "platinum credit card", if the corresponding neighboring entity is a bank-type entity, the preset term conditional probability is obtained as the term conditional probability between the proprietary entity and the corresponding neighboring entity. Otherwise, the word conditional probabilities of the corpus may be employed. For example, for a proprietary entity "platinum credit card", if its corresponding neighbor entity is "recruiter bank", a preset word conditional probability is adopted. If the corresponding neighbor entity is 'a shop', the word conditional probability of the corpus is adopted, so that when the special entity faces the entity of a specific type, the word vector of the special entity is not greatly changed due to the difference of the entity of the type, and the accuracy of the word vector is improved.

In some embodiments, after the text recognition model is obtained, the text recognition model may be utilized for recognition of the text. For example, the session message may be identified, e.g., encoded, and a reply message of the session message may be generated based on the encoded vector.

In some embodiments, when the reply message is generated, a conversation intention corresponding to the conversation message may also be acquired, an intention word vector representing the intention is acquired, the intention word vector is input into the text recognition model for recognition of the conversation message, an encoded vector is obtained, decoding is performed based on the encoded vector, and the reply message of the conversation message is generated.

In some embodiments, the conversation message is obtained by conversation voice recognition, and can obtain voice features in conversation voice of a speaker, obtain a plurality of behavior features corresponding to the speaker based on the posture information of the speaker to obtain a behavior feature set, and obtain a plurality of voice features corresponding to the speaker based on the conversation voice to obtain a voice feature set; combining the features in the behavior feature set with the features in the voice feature set to obtain combined features; and determining the corresponding conversation intention of the speaker based on the combined features. For example, the combined features are input into an intention recognition model for intention recognition.

It should be understood that, although the steps in the above-described flowcharts are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the above-mentioned flowcharts may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or the stages is not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a part of the steps or the stages in other steps.

In some embodiments, as shown in fig. 5, there is provided a text processing apparatus including: a text processing apparatus, the apparatus comprising:

a first training text obtaining module 502, configured to obtain a first training text;

a processed text obtaining module 504, configured to remove semantic irrelevant words in the first training text to obtain a processed text if the length of the first training text is greater than a length threshold;

an initial word vector sequence obtaining module 506, configured to obtain a word vector corresponding to each participle in the processed text, so as to obtain an initial word vector sequence;

a statistical word vector obtaining module 508, configured to perform statistics according to the initial word vector sequence to obtain a statistical word vector corresponding to a removed word in the first training text;

a target word vector sequence obtaining module 510, configured to add the statistical word vector to the initial word vector sequence according to the position of the removed word in the first training text, to obtain a target word vector sequence;

and the training module 512 is configured to train the text recognition model according to the target word vector sequence to obtain a trained text recognition model.

In some embodiments, as shown in fig. 6, the process text deriving module comprises:

a target entity obtaining unit 602, configured to obtain a target entity in the first training text if the length of the first training text is greater than a length threshold, where the target entity includes at least one of a head entity in the first training text or a tail entity in the first training text;

a semantic irrelevant word obtaining unit 604, configured to take an end word corresponding to the target entity in the first training text as a semantic irrelevant word that does not meet semantic requirements;

the removing unit 606 is configured to remove the semantic irrelevant word in the first training text to obtain a processed text.

In some embodiments, the semantic irrelevant word obtaining unit is used for performing at least one of the following steps; if the target entity comprises a first entity in the first training text, taking a word in the first training text before the first entity as a semantic irrelevant word which does not meet the semantic requirement; and if the target entity comprises a tail entity in the first training text, taking the words behind the tail entity in the first training text as semantic irrelevant words which do not meet the semantic requirements.

In some embodiments, the module for deriving the target coefficients is for: acquiring the distribution quantity of vector values corresponding to vector dimensions in each numerical range in the initial word vector sequence; and determining the target coefficient according to the distribution quantity of each numerical range.

In some embodiments, as shown in fig. 7, the word vector corresponding to each participle in the text is obtained based on a trained word vector model, and the module for training the word vector model includes:

a proprietary entity obtaining module 702, configured to obtain a second training text, and obtain a proprietary entity in the second training text;

a term conditional probability obtaining module 704, configured to obtain a preset term conditional probability as a term conditional probability between the proprietary entity and a corresponding neighboring entity, where the neighboring entity corresponding to the proprietary entity is an entity in the second training text whose distance from the proprietary entity is smaller than a distance threshold;

and a word vector model obtaining module 706, configured to train the word vector model according to the word conditional probability and the second training text, so as to obtain a trained word vector model.

For the specific definition of the text processing device, the above definition of the text processing method can be referred to, and is not described herein again. The respective modules in the text processing apparatus described above may be implemented in whole or in part by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In some embodiments, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 8. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing text processing data, such as training samples. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a text processing method.

Those skilled in the art will appreciate that the architecture shown in fig. 8 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In some embodiments, there is provided a computer device comprising a memory and a processor, the memory having stored therein a computer program that when executed by the processor performs the steps of: acquiring a first training text; if the length of the first training text is larger than the length threshold, removing semantic irrelevant words in the first training text to obtain a processed text; acquiring word vectors corresponding to all participles in a processed text to obtain an initial word vector sequence; counting according to the initial word vector sequence to obtain a statistical word vector corresponding to a removed word in the first training text; adding the statistical word vector into the initial word vector sequence according to the position of the removed word in the first training text to obtain a target word vector sequence; and training the text recognition model according to the target word vector sequence to obtain the trained text recognition model.

In some embodiments, there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of: acquiring a first training text; if the length of the first training text is larger than the length threshold, removing semantic irrelevant words in the first training text to obtain a processed text; acquiring word vectors corresponding to all participles in a processed text to obtain an initial word vector sequence; counting according to the initial word vector sequence to obtain a statistical word vector corresponding to a removed word in the first training text; adding the statistical word vector into the initial word vector sequence according to the position of the removed word in the first training text to obtain a target word vector sequence; and training the text recognition model according to the target word vector sequence to obtain the trained text recognition model.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of text processing, the method comprising:

acquiring a first training text;

if the length of the first training text is larger than a preset length threshold, removing semantic irrelevant words in the first training text to obtain a processed text;

obtaining word vectors corresponding to all the participles in the processed text to obtain an initial word vector sequence;

counting according to the initial word vector sequence to obtain a statistical word vector corresponding to a removed word in the first training text;

adding the statistical word vector into the initial word vector sequence according to the position of the removed word in the first training text to obtain a target word vector sequence;

and training the text recognition model according to the target word vector sequence to obtain the trained text recognition model.

2. The method according to claim 1, wherein if the length of the first training text is greater than a preset length threshold, removing semantic irrelevant words in the first training text to obtain a processed text comprises:

if the length of the first training text is larger than a preset length threshold, acquiring a target entity in the first training text, wherein the target entity comprises at least one of a head entity in the first training text or a tail entity in the first training text;

taking an end word corresponding to the target entity in the first training text as a semantic irrelevant word which does not meet semantic requirements, wherein the end word is a word positioned at the front end or the rear end of the first training text;

and removing the semantic irrelevant words in the first training text to obtain a processed text.

3. The method according to claim 2, wherein the step of regarding the terminal word corresponding to the target entity in the first training text as a semantic irrelevant word not meeting semantic requirements comprises at least one of the following steps;

if the target entity comprises a first entity in the first training text, taking a word in the first training text before the first entity as a semantic irrelevant word which does not meet semantic requirements;

and if the target entity comprises a tail entity in the first training text, taking words behind the tail entity in the first training text as semantic irrelevant words which do not meet semantic requirements.

4. The method of claim 1, wherein the performing statistics according to the initial word vector sequence to obtain a statistical word vector corresponding to a removed word in the first training text comprises:

counting vector values at the same position in each word vector of the initial word vector sequence to obtain a statistical value corresponding to each vector dimension;

and determining the vector value of the corresponding dimension in the statistical word vector according to the statistical value corresponding to the vector dimension to obtain the statistical word vector.

5. The method of claim 4, wherein the statistical values comprise a mean value and a standard deviation, and the determining the vector value of the corresponding dimension in the statistical word vector according to the statistical value corresponding to the vector dimension comprises:

multiplying the standard deviation by a target coefficient of a corresponding vector dimension to obtain a product;

and subtracting the product from the mean value to obtain the vector value of the corresponding dimension in the statistical word vector.

6. The method of claim 5, wherein the step of obtaining the target coefficients comprises:

acquiring the distribution quantity of vector values corresponding to the vector dimensions in each numerical range in the initial word vector sequence;

and determining the target coefficient according to the distribution quantity of each numerical range.

7. The method of claim 1, wherein the processing of the word vector corresponding to each participle in the text is based on a trained word vector model, and wherein the step of training the word vector model comprises:

acquiring a second training text, and acquiring a special entity in the second training text;

acquiring a preset word conditional probability as a word conditional probability between the proprietary entity and a corresponding neighboring entity, wherein the neighboring entity corresponding to the proprietary entity is an entity in the second training text, and the distance between the neighboring entity and the proprietary entity is smaller than a distance threshold;

and training a word vector model according to the word conditional probability and the second training text to obtain a trained word vector model.

8. A text processing apparatus, characterized in that the apparatus comprises:

the first training text acquisition module is used for acquiring a first training text;

a processed text obtaining module, configured to remove semantic irrelevant words in the first training text to obtain a processed text if the length of the first training text is greater than a length threshold;

an initial word vector sequence obtaining module, configured to obtain a word vector corresponding to each participle in the processed text, and obtain an initial word vector sequence;

a statistical word vector obtaining module, configured to perform statistics according to the initial word vector sequence to obtain a statistical word vector corresponding to a removed word in the first training text;

a target word vector sequence obtaining module, configured to add the statistical word vector to the initial word vector sequence according to the position of the removed word in the first training text, so as to obtain a target word vector sequence;

and the training module is used for training the text recognition model according to the target word vector sequence to obtain the trained text recognition model.

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.