CN111046658A

CN111046658A - Out-of-order text recognition method, device and equipment

Info

Publication number: CN111046658A
Application number: CN201911306126.0A
Authority: CN
Inventors: 孙建举
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2019-12-18
Filing date: 2019-12-18
Publication date: 2020-04-21
Anticipated expiration: 2039-12-18
Also published as: CN111046658B

Abstract

The embodiment of the specification discloses a method, a device and equipment for identifying disordered texts, wherein the disordered text identification scheme comprises the following steps: and inputting the extracted feature vectors of the texts to be recognized into the trained sequence to obtain a sequence model, and obtaining the feature vectors of the sequenced texts. And the encoder submodel and the decoder submodel in the trained sequence-to-sequence model are both recurrent neural network models. Calculating a difference value between the feature vector of the sorted text and the feature vector of the text to be recognized; and when the difference value is larger than the preset value, determining that the text information to be identified is out-of-order text information.

Description

Out-of-order text recognition method, device and equipment

Technical Field

One or more embodiments of the present disclosure relate to the field of computer technologies, and in particular, to a method, an apparatus, and a device for out-of-order text recognition.

Background

At present, lawbreakers can utilize a random information generator to generate a large amount of false information and use the false information to make profit. This way of exploiting the virucity of false information has gradually formed a black industrial chain. To combat illegal activities using the dummy information to make profit, the dummy information needs to be identified. Since the false information is typically randomly generated scrambled text, a need arises to identify the scrambled text.

In summary, how to accurately identify the out-of-order text becomes a technical problem to be solved urgently.

Disclosure of Invention

In view of this, one or more embodiments of the present specification provide a method, an apparatus, and a device for recognizing out-of-order text, which are used to perform precise recognition on out-of-order text.

In order to solve the above technical problem, the embodiments of the present specification are implemented as follows:

the method for recognizing the disordered text provided by the embodiment of the specification comprises the following steps:

acquiring text information to be identified;

extracting the characteristics of the text information to be recognized to obtain a characteristic vector of the text to be recognized;

inputting the feature vector of the text to be recognized into a trained sequence to a sequence model to obtain the feature vector of the sequenced text; the trained sequence-to-sequence model is generated by training an initial sequence-to-sequence model by using a feature vector of a positive sequence text, the initial sequence-to-sequence model comprises an encoder submodel and a decoder submodel, and the encoder submodel and the decoder submodel are both recurrent neural network models;

calculating a difference value between the feature vector of the sequenced text and the feature vector of the text to be recognized;

and when the difference value is larger than a preset value, determining that the text information to be identified is out-of-order text information.

The training method from a sequence to a sequence model provided by the embodiment of the specification comprises the following steps:

acquiring a sample set, wherein samples in the sample set are positive sequence text information;

performing feature extraction on each sample in the sample set to obtain a sample feature vector set;

and training an initial sequence to sequence model by using the sample characteristic vector set to obtain a trained sequence to sequence model, wherein the initial sequence to sequence model comprises an encoder submodel and a decoder submodel, and the encoder submodel and the decoder submodel are recurrent neural network models.

An out-of-order text recognition device provided by the embodiments of this specification includes:

the acquisition module is used for acquiring text information to be identified;

the characteristic extraction module is used for extracting the characteristics of the text information to be recognized to obtain the characteristic vector of the text to be recognized;

the sequencing module is used for inputting the feature vectors of the texts to be recognized into the trained sequence to obtain the feature vectors of the sequenced texts; the trained sequence-to-sequence model is generated by training an initial sequence-to-sequence model by using a feature vector of a positive sequence text, the initial sequence-to-sequence model comprises an encoder submodel and a decoder submodel, and the encoder submodel and the decoder submodel are both recurrent neural network models;

the difference value calculation module is used for calculating the difference value between the feature vector of the sorted text and the feature vector of the text to be recognized;

and the disorder text information determining module is used for determining the text information to be identified as the disorder text information when the difference value is greater than a preset value.

An embodiment of this specification provides a training apparatus from a sequence to a sequence model, including:

the acquisition module is used for acquiring a sample set, wherein samples in the sample set are positive sequence text information;

the characteristic extraction module is used for extracting the characteristics of each sample in the sample set to obtain a sample characteristic vector set;

and the training module is used for training an initial sequence to sequence model by using the sample characteristic vector set to obtain a trained sequence to sequence model, the initial sequence to sequence model comprises an encoder submodel and a decoder submodel, and the encoder submodel and the decoder submodel are recurrent neural network models.

An out-of-order text recognition device provided by the embodiments of the present specification includes:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to:

acquiring text information to be identified;

An embodiment of the present specification provides a training apparatus from a sequence to a sequence model, including:

at least one processor; and the number of the first and second groups,

At least one embodiment in this specification can achieve the following advantageous effects: and inputting the extracted feature vectors of the texts to be recognized into the trained sequence to obtain the feature vectors of the sequenced texts. When the difference value between the feature vector of the sorted text and the feature vector of the text to be recognized is larger than a preset value, the text information to be recognized can be determined to be out-of-order text information. Because the encoder submodel and the decoder submodel in the trained sequence-to-sequence model are both recurrent neural network models, the accuracy of the feature vector of the sequenced text obtained by processing the trained sequence-to-sequence model is higher, and the accuracy of the out-of-order text recognition result obtained based on the feature vector of the sequenced text can be improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of one or more embodiments of the disclosure and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the disclosure and together with the description serve to explain the embodiments of the disclosure and not to limit the embodiments of the disclosure. In the drawings:

fig. 1 is a schematic flow chart of a method for recognizing out-of-order text according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a sequence-to-sequence model provided in an embodiment of the present disclosure;

FIG. 3 is a flowchart illustrating a method for training a sequence-to-sequence model according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of a scrambled text recognition apparatus corresponding to fig. 1 provided in an embodiment of the present specification;

FIG. 5 is a schematic structural diagram of a training apparatus corresponding to the sequence-to-sequence model shown in FIG. 3 according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of a scrambled text recognition device corresponding to fig. 1 provided in an embodiment of this specification.

Detailed Description

To make the objects, technical solutions and advantages of one or more embodiments of the present disclosure more apparent, the technical solutions of one or more embodiments of the present disclosure will be described in detail and completely with reference to the specific embodiments of the present disclosure and the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present specification, and not all embodiments. All other embodiments that can be derived by a person skilled in the art from the embodiments given herein without making any creative effort fall within the scope of protection of one or more embodiments of the present specification.

The technical solutions provided by the embodiments of the present description are described in detail below with reference to the accompanying drawings.

In the prior art, lawless persons collect preferential information of each financial institution and various merchants, and exchange high rewards with low cost or even zero cost in commercial marketing activities in ways of registering false accounts or fictitious user information and the like, which brings adverse effects on normal operation of business of each enterprise. Because a large amount of disordered texts exist in the user name, address information, mailbox information and other information used by a lawless person, the identification of the lawless person can be realized by identifying the disordered texts.

At present, when identifying out-of-order texts, a large amount of forward-order texts are generally counted to obtain the probability that any two characters are adjacent. When the text to be recognized is recognized, the probability of each word in the text to be recognized appearing at each position can be determined, and then the probability that the text to be recognized is a disordered text is obtained. The method for recognizing the disordered text only considers the incidence relation between adjacent texts, so that the accuracy of the disordered text recognition result is poor. And because a large number of text samples are needed for statistical training, the method is not suitable for scenes with limited number of samples, and thus, the method for recognizing the disordered text has poor applicability.

In order to solve the defects in the prior art, the scheme provides the following embodiments:

fig. 1 is a schematic flow chart of a method for recognizing out-of-order text according to an embodiment of the present disclosure. From the viewpoint of the program, the execution subject of the flow may be a program installed in the application server.

As shown in fig. 1, the process may include the following steps:

step 102: and acquiring text information to be recognized.

In the embodiment of the present specification, the type of text information to be recognized differs according to different application scenarios. For example, when applied to a scenario of identifying a registered false account, the text information to be identified may be account registration information of the user account. For example, the text information to be recognized may include information such as a user name, a mailbox account identifier, a user address, a shop name, and the like, which are input when the user registers an account. When the method is applied to a scene of adopting the false information to pick up the resource, the text information to be recognized can be used for obtaining account information of the specified resource for the user. For example, the text information to be recognized may be information such as a user name, a user account identifier, a contact mailbox, a contact address, and the like, which are input when a user domain merchant coupon or a red packet is received. Therefore, the text information to be recognized can contain information in various formats such as Chinese characters, numbers, English words, English letters, punctuation marks and the like.

Step 104: and extracting the characteristics of the text information to be recognized to obtain the characteristic vector of the text to be recognized.

In the embodiments of the present specification, Word vector (Word embedding) means a vector in which words or phrases from a vocabulary are mapped to real numbers, and the Word vector is an important basis in natural language processing, and facilitates analysis of text, emotion, Word sense, and the like. Therefore, the word vector of the text to be recognized, which is obtained by extracting the features of the text information to be recognized, can be determined as the feature vector of the text to be recognized.

Step 106: inputting the feature vector of the text to be recognized into a trained sequence to a sequence model to obtain the feature vector of the sequenced text; the trained sequence-to-sequence model is generated by training an initial sequence-to-sequence model by using a feature vector of a positive sequence text, the initial sequence-to-sequence model comprises an encoder sub-model and a decoder sub-model, and the encoder sub-model and the decoder sub-model are both recurrent neural network models.

In the embodiments of the present specification, a Sequence-to-Sequence model (i.e., Sequence-to-Sequence model, abbreviated as Sequence 2 Sequence model) is one of the encoding-decoding models. Fig. 2 is a schematic structural diagram of a sequence-to-sequence model provided in an embodiment of the present disclosure. As shown in fig. 2, the sequence-to-sequence model may include an encoder sub-model 202 and a decoder sub-model 204.

For ease of understanding, the operation of the sequence-to-sequence model is illustrated in conjunction with the description of FIG. 2. Suppose that the text to be recognized contains three words or characters, and feature vectors (i.e. word vectors) corresponding to the three words or characters in sequence in the text to be recognized are w respectively₀、w₁And w₂. Assume that the initial hidden state of the encoder submodel 202 is h₀. The hidden state h output by the encoder submodel 202 at each moment in time_j＝L(w_j,h_j-1) Where L represents the expression of the encoder submodel, w_jIs the characteristic vector corresponding to the jth word or character in the text to be recognized, h_j-1Hidden states output for the encoder submodel at the last moment. The hidden states output by the encoder submodel 202 at each time are sequentially arranged to obtain a vector v, which may be expressed as v ═ h (h)₁,h₂,h₃)。

Assume that the initial output of the decoder submodel 204 is t₀. The input at each time instant of the decoder sub-model 204 is the vector v and the output of the decoder sub-model 204 at the last time instant. The decoder submodel 204 outputs t at each instant_j＝F(v,t_j-1) Wherein, F represents the expression of decoder submodel, v is the vector obtained by arranging the hidden states output by the encoder submodel at each moment in sequence, t_j-1The output of the decoder submodel at the last moment. The outputs of the decoder submodels 204 at each time are arranged in sequence to obtain an outputThe output vector can be expressed as T ═ T (T)₁,t₂,t₃). The output vector is the feature vector of the sequenced text obtained after sequencing the text to be recognized.

In this embodiment of the present specification, the feature vector of the positive sequence text may be used to train the initial sequence to sequence model in advance, so that the trained sequence to sequence model learns the rules and features of the positive sequence text. And after the feature vector of the text to be recognized is input into the trained sequence-to-sequence model, the trained sequence-to-sequence model processes the feature vector of the text to be recognized according to the learned rules and features of the text in the positive sequence, so as to generate the feature vector of the text after sequencing which accords with the rules and features of the text in the positive sequence. That is, the feature vector of the text to be recognized is input into the trained sequence to the feature vector of the sorted text obtained by the sequence model, and is considered to be the feature vector of the text closest to the positive sequence text, which can be formed by the words or characters included in the text to be recognized.

In the embodiment of the present specification, both the encoder submodel and the decoder submodel in the sequence-to-sequence model may be implemented by using a recurrent neural network model. The recurrent neural network model (recurrent neural network) includes various algorithm models, such as a recurrent neural network (recurrent neural network), a long-short Term Memory network (LongShort-Term Memory), and the like. The encoder submodel and the decoder submodel in the sequence-to-sequence model in step 106 may be implemented by any recurrent neural network model.

In practical application, gradient explosion is easy to generate due to the long-order dependence problem of the recurrent neural network. And the long-short term memory network model introduces a forgetting gate, and is easy to converge, so that the problem of long-order dependence can be solved. And the long-short term memory network model can mine the relationship among all sequence elements, and is suitable for processing and predicting important events with very long interval and delay in time sequences, so that the model is suitable for scenes for reordering text sequences according to a certain rule.

When both the encoder submodel and the decoder submodel in the trained sequence-to-sequence model adopt a long-short term memory network model, experiments prove that the F1 fraction (F1Score) of the trained sequence-to-sequence model can reach 0.91. Wherein the F1score is the harmonic mean of the precision rate and the recall rate. Therefore, the accuracy and the precision of the feature vector of the sequenced text obtained by the trained sequence-to-sequence model processing are good, and the effectiveness of the out-of-order text recognition method is further improved. And the trained sequence-to-sequence model can be obtained only by using a small amount of positive samples to train the initial model, and a large amount of positive and negative training samples are not needed, so that the out-of-order text recognition method can be suitable for scenes with a small number of training samples, and the universality of the out-of-order text recognition method is favorably improved.

Step 108: and calculating difference values between the feature vectors of the sorted texts and the feature vectors of the texts to be recognized.

Step 110: and when the difference value is larger than a preset value, determining that the text information to be identified is out-of-order text information.

In the embodiment of the present specification, the feature vector of the text after sorting can be regarded as the feature vector of the closest positive sequence text which can be formed by the words or characters contained in the text to be recognized. Therefore, when the difference value between the feature vector of the sorted text and the feature vector of the text to be recognized is smaller than or equal to the preset value, the text to be recognized can be considered to conform to the feature of the forward text, that is, the text information to be recognized can be considered as the forward text information. When the difference value between the feature vector of the sorted text and the feature vector of the text to be recognized is larger than a preset value, the text to be recognized can be considered to be not in accordance with the features of the forward-order text, and the text information to be recognized can be considered to be out-of-order text information.

It should be understood that the order of some steps in the method described in one or more embodiments of the present disclosure may be interchanged according to actual needs, or some steps may be omitted or deleted.

In the method in fig. 1, the trained sequence-to-sequence model is used to process the feature vectors of the text information to be recognized, so as to obtain the feature vectors of the sorted text. And identifying whether the text information to be identified is out-of-order text information or not by calculating a difference value between the feature vector of the sorted text and the feature vector of the text to be identified and comparing the difference value with a preset value. The trained sequence-to-sequence model is obtained by training the feature vectors of the positive-sequence text, and the encoder submodel and the decoder submodel of the trained sequence-to-sequence model can be recurrent neural network models, so that the accuracy of the feature vectors of the sequenced text obtained by processing based on the trained sequence-to-sequence model is higher, and the accuracy of the identification result of the disordered text can be improved.

Based on the process of fig. 1, some specific embodiments of the process are also provided in the examples of this specification, which are described below.

In the present embodiment, step 104: the feature extraction of the text information to be recognized may specifically include:

and performing feature extraction on the text information to be recognized by adopting a trained Word2vec model, wherein the trained Word2vec model is generated by training an initial Word2vec model by using positive sequence text information.

In this embodiment of the present specification, a Word2vec model may be used to map each Word in the text information to be recognized to a feature vector (i.e., Word embedding), which may represent a relationship between words. In practical application, the original Word2vec model can be trained by using the forward-order text information in advance, so that the trained Word2vec model learns the relation between words in the forward-order text, the accuracy of the feature vector of the text to be recognized, which is extracted based on the trained Word2vec model, is better, and the accuracy of the out-of-order text recognition method in the embodiment of the specification is further improved.

In the embodiments of the present specification, there are various types of text information to be recognized, for example, names, addresses, mailboxes, and the like. In practical application, the same type of forward text sample is used for training an initial Word2vec model, and the trained Word2vec model is used for carrying out feature extraction on the text information to be recognized, which is the same as the type of the forward text sample. For example, a trained Word2vec model trained by using positive-order address text samples can be used for extracting the features of address text information but cannot be used for extracting the features of name text information, so as to ensure the accuracy of the extracted feature vectors of the text to be recognized.

In the present specification embodiment, step 108: calculating a difference value between the feature vector of the sorted text and the feature vector of the text to be recognized, which may specifically include:

and calculating a distance value between the feature vector of the sequenced text and the feature vector of the text to be recognized.

In the embodiments of the present specification, there are various implementation methods for calculating the distance value between the feature vector of the sorted text and the feature vector of the text to be recognized. For example, a euclidean distance calculation method, a manhattan distance calculation method, a minuscule distance calculation method, or the like can be used. In practical application, a Root Mean square Error (Root Mean square Error) calculation formula can be adopted to calculate the distance value between the feature vector of the sorted text and the feature vector of the text to be recognized, so that the accuracy of the calculated distance value is improved.

In the embodiment of the present specification, the types of text information to be recognized in different application scenarios are also different. When applied to a scenario of identifying a false registered account, the text information to be identified may be account registration information. Step 110: after determining that the text information to be recognized is out-of-order text information, the method may further include:

and generating a use forbidding control instruction to forbid the user from using the account corresponding to the account registration information.

In this embodiment of the present specification, when the text information to be recognized is account registration information and it is determined that the account registration information is out-of-order text, it may be considered that the false registration information is used when the user registers an account. The account registered by the false registration information has a higher probability of being a false account registered by a lawbreaker, so that a control instruction for forbidding use can be generated to stop the account corresponding to the account registration information. Therefore, the probability that lawbreakers execute illegal behaviors by using false registered accounts is reduced, and the normal operation of financial institutions or commercial tenants is further ensured.

In practical applications, the other information of the account corresponding to the account registration information may be simultaneously combined to finally determine whether the account corresponding to the account registration information is a false registration account, and thus, embodiments of the present specification are not specifically limited.

In this embodiment of the present specification, when the method is applied to a scenario where a user acquires a specified resource by using a false registered account or false account information, the text information to be recognized may be account information used by the user to acquire the specified resource. Step 110: after determining that the text information to be recognized is out-of-order text information, the method may further include:

and generating a use forbidding control instruction to forbid the user from using the specified resource in the account corresponding to the account information.

In the embodiments of the present specification, for convenience of understanding, a scenario in which a user acquires a specified resource using a false registered account or false account information is illustrated. For example, lawbreakers may purchase a large number of Subscriber Identity Modules (SIM) to obtain a large number of usage rights of mobile phone numbers. And a large amount of false user information is generated by utilizing software such as a name generator, an identity card generator and the like. Lawbreakers will use the cell phone number and the false user information to register a large number of false registered accounts at each platform to obtain the benefits offered by financial institutions and various merchants. Alternatively, lawbreakers may also use the compiled false account information directly to obtain offers from financial institutions and various merchants. This behavior of a lawbreaker is called "wool breaking" behavior. The above-mentioned financial institution and various merchants offer the special resource that the user obtains by using the false registered account or the false account information. The specified resources may include, but are not limited to, red envelope, voucher, gift, mobile traffic, etc.

In the embodiment of the present specification, when the text information to be recognized is account information used by the user to acquire the specified resource and it is determined that the account information used for acquiring the specified resource is the out-of-order text, it may be considered that the user acquires the specified resource by using the false registered account or the false account information, and the user does not conform to the distribution rule of the specified resource. Accordingly, a disable control directive can be generated to disable the specified resource in the account to which the account information corresponds. Or refusing to issue the specified resource to the account corresponding to the account information. Therefore, the probability of 'wool' profit by lawbreakers using false registered account or false account information is reduced, and the normal operation of financial institutions or commercial businesses is further guaranteed.

In this embodiment of the present specification, after determining that the text information to be recognized is out-of-order text information, the method may further include: and generating prompt information, wherein the prompt information is used for prompting that the text information to be identified is out-of-order text information. Therefore, the application platform can conveniently carry out subsequent processing on the identified out-of-order text information, and the practicability of the out-of-order text identification method is improved.

Fig. 3 is a flowchart illustrating a method for training a sequence-to-sequence model according to an embodiment of the present disclosure. From the viewpoint of the program, the execution subject of the flow may be a program installed in the application server.

As shown in fig. 3, the process may include the following steps:

step 302: obtaining a sample set, wherein samples in the sample set are positive sequence text information.

In the embodiment of the present specification, due to different application scenarios, the functions of the trained sequence to be obtained are different, and further, the types of samples in the sample set used for training the initial sequence to the sequence model are also different. For example, when the name text to be recognized needs to be recognized out of order, the trained sequence-to-sequence model is used for sorting the name text to be recognized, and correspondingly, the samples in the sample set used for training the initial sequence-to-sequence model should be the name text in the positive order. In the embodiment of the present specification, the types of the respective samples (i.e., the forward text information) in one sample set are all the same.

Step 304: and performing feature extraction on each sample in the sample set to obtain a sample feature vector set.

In this embodiment, step 304 may specifically include: and performing feature extraction on each sample in the sample set by adopting a trained Word2vec model, wherein the trained Word2vec model is generated by training an initial Word2vec model by using positive sequence text information.

In the embodiment of the specification, the forward text information for training the initial Word2vec model is the same as the type of the sample in the sample set. For example, when the samples in the sample set are name texts, the trained Word2vec model used for feature extraction of the samples in the sample set should be generated by training the initial Word2vec model using the name texts in the positive order. Therefore, the accuracy of the sample feature vector obtained by extracting the features of the samples in the sample set by adopting the trained Word2vec model is ensured.

Step 306: and training an initial sequence to sequence model by using the sample characteristic vector set to obtain a trained sequence to sequence model, wherein the initial sequence to sequence model comprises an encoder submodel and a decoder submodel, and the encoder submodel and the decoder submodel are recurrent neural network models.

In the embodiment of the present specification, because the feature vectors in the sample feature vector set are obtained by feature extraction of the forward text information, after the initial sequence is trained to the sequence model by using the sample feature vector set, the trained sequence to sequence model can learn the rules and features of the forward text. And after the feature vector of the text to be recognized is input into the trained sequence-to-sequence model, the trained sequence-to-sequence model processes the feature vector of the text to be recognized according to the learned rules and features of the text in the positive sequence, so as to generate the feature vector of the text after sequencing which accords with the rules and features of the text in the positive sequence. That is, it can be considered that the trained sequence-to-sequence model can sequence the text information to be recognized to obtain a feature vector of a nearest forward-sequence text that can be formed by words or characters included in the text information to be recognized.

In the embodiment of the present specification, since the word vector is used to train the initial sequence-to-sequence model to obtain the trained sequence-to-sequence model, the initial sequence-to-sequence model can be trained by using a small amount of feature vectors (i.e., word vectors) of the positive-sequence text information to obtain the sequenced sequence-to-sequence model with high accuracy and precision. It can be seen that the sequence-to-sequence model training method in the embodiments of the present disclosure has good universality, and is not only suitable for scenes with rich training samples, but also suitable for scenes with fewer training samples.

In the embodiments of the present disclosure, the encoder submodel and the decoder submodel in the initial sequence-to-sequence model may be implemented by any recursive neural network model. The recurrent neural network model (recurrent neural network) includes various algorithm models, such as a recurrent neural network (recurrent neural network), a Long-Short Term Memory network (Long Short-Term Memory), and the like. When the encoder submodel and the decoder submodel in the initial sequence-to-sequence model both adopt a long-short term memory network model, experiments prove that the F1 fraction (F1Score) of the trained sequence-to-sequence model can reach 0.91. Wherein the F1score is the harmonic mean of the precision rate and the recall rate. Therefore, the accuracy and precision from the trained sequence to the sequence model are good. And the trained sequence-to-sequence model is used for processing the feature vectors of the text information to be recognized, so that the feature vectors of the sequenced texts are better in accuracy and precision, and the effectiveness of the out-of-order text recognition method is improved.

Based on the same idea, the embodiment of the present specification further provides an apparatus corresponding to the method in fig. 1. Fig. 4 is a schematic structural diagram of a scrambled text recognition apparatus corresponding to fig. 1 according to an embodiment of the present disclosure. As shown in fig. 4, the apparatus may include:

an obtaining module 402, configured to obtain text information to be recognized.

And the feature extraction module 404 is configured to perform feature extraction on the text information to be recognized to obtain a feature vector of the text to be recognized.

The sorting module 406 is configured to input the feature vector of the text to be recognized into a trained sequence to a sequence model, so as to obtain a feature vector of the sorted text; the trained sequence-to-sequence model is generated by training an initial sequence-to-sequence model by using a feature vector of a positive sequence text, the initial sequence-to-sequence model comprises an encoder sub-model and a decoder sub-model, and the encoder sub-model and the decoder sub-model are both recurrent neural network models. Specifically, the encoder submodel and the decoder submodel may be both long-term and short-term memory network models.

A difference value calculating module 408, configured to calculate a difference value between the feature vector of the sorted text and the feature vector of the text to be recognized.

A disorder text information determining module 410, configured to determine that the text information to be identified is a disorder text information when the difference value is greater than a preset value.

The examples of this specification also provide some specific embodiments of the process based on the apparatus of fig. 4, which is described below.

In the illustrated embodiment, the feature extraction module 404 may be specifically configured to:

In an embodiment of the specification, the difference value calculating module 408 may be specifically configured to:

In an embodiment of the specification, when the text information to be recognized is account registration information, the apparatus may further include:

and the use prohibition control instruction generation module is used for generating a use prohibition control instruction after the text information to be identified is determined to be the disordered text information so as to prohibit the user from using the account corresponding to the account registration information.

In an embodiment of the specification, when the text information to be recognized is account information used by a user to acquire a specified resource, the apparatus in fig. 4 may further include:

and the use prohibition control instruction generation module is used for generating a use prohibition control instruction after the text information to be identified is determined to be the disordered text information so as to prohibit the user from using the specified resource in the account corresponding to the account information.

In an embodiment of the specification, the apparatus in fig. 4 may further include a prompt information generation module. The prompt information generation module may be configured to generate prompt information after determining that the text information to be recognized is out-of-order text information, where the prompt information is used to prompt that the text information to be recognized is out-of-order text information.

Based on the same idea, the embodiment of the present specification further provides an apparatus corresponding to the method in fig. 3. Fig. 5 is a schematic structural diagram of a training apparatus of a sequence-to-sequence model corresponding to fig. 3 provided in an embodiment of the present disclosure. As shown in fig. 5, the apparatus may include:

the obtaining module 502 is configured to obtain a sample set, where a sample in the sample set is positive sequence text information.

A feature extraction module 504, configured to perform feature extraction on each sample in the sample set to obtain a sample feature vector set.

A training module 506, configured to train an initial sequence-to-sequence model using the sample feature vector set, to obtain a trained sequence-to-sequence model, where the initial sequence-to-sequence model includes an encoder sub-model and a decoder sub-model, and both the encoder sub-model and the decoder sub-model are recurrent neural network models. Specifically, the encoder submodel and the decoder submodel may be both long-term and short-term memory network models.

In this embodiment of the present specification, the feature extraction module 504 may be specifically configured to:

and performing feature extraction on each sample in the sample set by adopting a trained Word2vec model, wherein the trained Word2vec model is generated by training an initial Word2vec model by using positive sequence text information.

Based on the same idea, the embodiment of the present specification further provides a device corresponding to the method in fig. 1.

Fig. 6 is a schematic structural diagram of a scrambled text recognition device corresponding to fig. 1 provided in an embodiment of this specification. As shown in fig. 6, the apparatus 600 may include:

at least one processor 610; and the number of the first and second groups,

a memory 630 communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory 630 stores instructions 620 executable by the at least one processor 610 to enable the at least one processor 610 to:

and acquiring text information to be recognized.

And extracting the characteristics of the text information to be recognized to obtain the characteristic vector of the text to be recognized.

Inputting the feature vector of the text to be recognized into a trained sequence to a sequence model to obtain the feature vector of the sequenced text; the trained sequence-to-sequence model is generated by training an initial sequence-to-sequence model by using a feature vector of a positive sequence text, the initial sequence-to-sequence model comprises an encoder sub-model and a decoder sub-model, and the encoder sub-model and the decoder sub-model are both recurrent neural network models.

And calculating difference values between the feature vectors of the sorted texts and the feature vectors of the texts to be recognized.

Based on the same idea, the embodiment of the present specification further provides a sequence-to-sequence model training device corresponding to the method in fig. 3. The apparatus comprises:

at least one processor; and the number of the first and second groups,

obtaining a sample set, wherein samples in the sample set are positive sequence text information.

And performing feature extraction on each sample in the sample set to obtain a sample feature vector set.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

In the 90 s of the 20 th century, improvements in a technology could clearly distinguish between improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, switches, etc.) and improvements in software (improvements in process flow). However, as technology advances, many of today's process flow improvements have been seen as direct improvements in hardware circuit architecture. Designers almost always obtain the corresponding hardware circuit structure by programming an improved method flow into the hardware circuit. Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical modules. For example, a Programmable Logic Device (PLD), such as a Field Programmable Gate Array (FPGA), is an integrated circuit whose Logic functions are determined by programming the Device by a user. A digital system is "integrated" on a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Furthermore, nowadays, instead of manually making an integrated Circuit chip, such Programming is often implemented by "logic compiler" software, which is similar to a software compiler used in program development and writing, but the original code before compiling is also written by a specific Programming Language, which is called Hardware Description Language (HDL), and HDL is not only one but many, such as abel (advanced Boolean Expression Language), ahdl (alternate Language Description Language), traffic, pl (core unified Programming Language), HDCal, JHDL (Java Hardware Description Language), langue, Lola, HDL, laspam, hardsradware (Hardware Description Language), vhjhd (Hardware Description Language), and vhigh-Language, which are currently used in most common. It will also be apparent to those skilled in the art that hardware circuitry that implements the logical method flows can be readily obtained by merely slightly programming the method flows into an integrated circuit using the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, and an embedded microcontroller, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20, and Silicone LabsC8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may thus be considered a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functionality of the various elements may be implemented in the same one or more software and/or hardware implementations in implementing one or more embodiments of the present description.

One skilled in the art will recognize that one or more embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, one or more embodiments of the present description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

One or more embodiments of the present description are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to one or more embodiments of the description. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

One or more embodiments of the present description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. One or more embodiments of the specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is merely exemplary of the present disclosure and is not intended to limit one or more embodiments of the present disclosure. Various modifications and alterations to one or more embodiments described herein will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of one or more embodiments of the present specification should be included in the scope of claims of one or more embodiments of the present specification.

Claims

1. A method of out-of-order text recognition, comprising:

acquiring text information to be identified;

2. The method according to claim 1, wherein the extracting the features of the text information to be recognized specifically comprises:

3. The method according to claim 1, wherein the calculating a difference value between the feature vector of the sorted text and the feature vector of the text to be recognized specifically includes:

4. The method of claim 1, wherein the text information to be recognized is account registration information, and after determining that the text information to be recognized is out-of-order text information, the method further comprises:

5. The method of claim 1, wherein the text information to be recognized is account information used by a user to obtain a specified resource, and after determining that the text information to be recognized is out-of-order text information, the method further comprises:

6. The method of claim 1, after determining that the text information to be recognized is out-of-order text information, further comprising:

and generating prompt information, wherein the prompt information is used for prompting that the text information to be identified is out-of-order text information.

7. The method of claim 1, the encoder sub-model and the decoder sub-model are both long-short term memory network models.

8. A method of sequence-to-sequence model training, comprising:

9. The method according to claim 8, wherein the extracting features of each sample in the sample set specifically comprises:

10. The method of claim 8, the encoder submodel and the decoder submodel are both long-short term memory network models.

11. An out-of-order text recognition apparatus comprising:

the acquisition module is used for acquiring text information to be identified;

12. The apparatus of claim 11, wherein the feature extraction module is specifically configured to:

13. The apparatus of claim 11, wherein the text information to be recognized is account registration information, the apparatus further comprising:

14. The apparatus of claim 11, wherein the text information to be recognized is account information used by a user to obtain a specified resource, the apparatus further comprising:

15. The apparatus of claim 11, the encoder submodel and the decoder submodel are both long-short term memory network models.

16. A sequence-to-sequence model training apparatus, comprising:

17. The apparatus of claim 16, wherein the feature extraction module is specifically configured to:

18. The apparatus of claim 16, the encoder submodel and the decoder submodel are both long-short term memory network models.

19. An out-of-order text recognition device comprising:

at least one processor; and the number of the first and second groups,

acquiring text information to be identified;

20. A sequence-to-sequence model training apparatus comprising:

at least one processor; and the number of the first and second groups,