Embodiment
This specification proposes a kind of text correcting method, and the text obtained using machine learning corrects model to enter to text
Row is corrected, wherein, text, which corrects model, can use seq2seq (Sequence-to-Sequence) model.The seq2seq models
The text that can be used for correcting includes but is not limited to:The title of various objects (such as place, people, company), the Query for inquiry
Entry.Wherein, for each received text, a variety of non-standard texts are can correspond to, received text can be predefined one
Kind standard scale reaches, and non-standard text can be the change for the partial character that the basis reached in standard scale is made, for example, certain standard is literary
Originally it is:" Luck did better than Huan ", non-standard text corresponding with the received text can be:“Luck did
Better then Huan " or " Luck do better than Huan " etc..Reality text identification scene in, it is expected by
Non-standard text identification is into corresponding received text caused by rewriting or misspelling etc., to reach higher text identification
Rate.
Fig. 1 is the structure of the text correction model according to an exemplary embodiment, as shown in figure 1, the text is corrected
Model includes coding (Encoder) network and decoding (Decoder) network, and coding network and decoding network can be circulation nerve net
Network (recurrent neural network, RNN), such as:Shot and long term remembers (Long short-term memory, LSTM) net
Network.Wherein, the input of the coding network can be with input text (text to be identified) corresponding characteristic vector (x1,
X2 ..., xn), x1, x2 ..., xn can represent the character inputted in text respectively.The coding network can use
In vector (fixed-length vector) that text code is fixed size will be inputted and as the defeated of decoding network
Enter, the decoding network can be used for being decoded according to the output of above-mentioned coding network, export a vector (y1, y2 ...,
Ym), received text finally can be determined according to the vector (y1, y2 ..., ym), wherein, y1, y2 ..., ym can be used respectively
To represent a character in received text.In RNN, several nodes can be generally included, each node is counted according to input
Calculate corresponding output, also, the output of the output of the latter node and previous node about (previous node it is defeated
Go out the input as the latter node).In practical application, RNN can be handled the text of random length, and (that is, n, m's takes
Value can not fix).
The characteristics of LSTM is to add one in the algorithm to judge information whether useful " processor " (being referred to as cell).One
Three fan doors can be generally placed in individual cell, has been called input gate respectively, forgets door and out gate.One information enters LSTM nets
Among network, it can be judged whether according to rule useful.Only meeting the information of algorithm certification can just leave, and the information not being inconsistent is then
Passed into silence by forgeing door.LSTM belongs to a kind of technology well-known to those skilled in the art, and this is not described in detail herein.
Say be exactly nothing but one-in-and-two-out operation principle, can but be solved under computing repeatedly in neutral net long-term
Existing big problem.At present it has been proved that LSTM is the effective technology for solving long sequence Dependence Problem, and this technology is pervasive
Property it is very high, cause the possibility brought to change very more.Each researcher proposes the variable version of oneself according to LSTM one after another,
This just allows LSTM to handle Protean Perpendicular Problems.
Next, the process for obtaining above text by machine learning and correcting model will be introduced first.In one embodiment,
The process that training text corrects model may include 10~step 30 of following steps, wherein:
Step 10:Obtaining includes the sample set of some samples pair, wherein, the sample is to including a non-standard text and one
Received text.
For example, the sample set of acquisition is following (wherein, X represents non-standard text, and Y represents received text):
Step 20:For each sample pair, the non-standard text is converted into the first coding vector using coding rule,
The received text is converted into the second coding vector.
In one embodiment, the difference for the character types that can be directed in text, selects the volume corresponding with character types
Code rule determines coding vector.In one embodiment, when the character types for detecting some sample are Chinese, can use
Chinese character coding rule encodes to each chinese character in text, to obtain coded sequence dyad;If detect certain
The character types of individual sample are non-Chinese (such as English), then the non-Chinese character in text are carried out using ASCII coding rules
Coding, to obtain coded sequence dyad.Wherein, encoding of chinese characters refers to representing the character code of Chinese character in a computer,
Chinese character coding rule is, for example,:One-hot coding rules, Chinese internal code coding rule, Chinese character international code coding rule, position
Code coding rule etc..Come respectively to Chinese character and non-middle word by using Chinese character coding rule and ASCII coding rules
Symbol is encoded, it is possible to achieve the correction to some unconventional words.Wherein, the unconventional word of definable is what is do not included in dictionary
Word, such as:daueoeo.In some texts correct scene, it usually needs name, place name etc. are corrected, and name and ground
Name is often some user-defined words (i.e. unconventional word), and this causes to limit to text error-correcting effect, passes through above-mentioned coding
Rule can effectively lift text error correction effect.
When carrying out vectorization to coded sequence, different character types can be directed to, using corresponding with character types
Vectorization rule carry out vectorization.Specifically, can be according to sequencing, one by one by every N (N >=1) position in character string
A numerical value being converted into characteristic vector.
Step 30:Utilize first coding vector and second coding vector training coding network and decoding net
Network, obtain the text comprising the coding network and the decoding network and correct model.
Specifically, the sample set for training pattern includes some samples to (X, Y), if Xi represents non-standard text (i.e.
The input of coding network), Yi represents received text (i.e. the output of decoding network), wherein, (N is the number of sample pair to 1≤i≤N
Amount).P (Yi | Xi) value can be obtained by coding network and decoding network, then by EM algorithm (Expectation
Maximization Algorithm, EMA), it can obtain maximizing conditional likelihood, i.e.,:
Wherein, θ can represent to treat training parameter in seq2seq models.
In an alternate embodiment of the invention, (steepest descent) Algorithm for Training seq2seq moulds can be declined by gradient
Type.
Fig. 2 is a kind of flow of text correcting method according to an exemplary embodiment, and this method can apply to
Each class of electronic devices is (such as:User equipment or server) in, this method utilizes the seq2seq moulds obtained above by machine learning
Type realizes that this method can include:
Step 101, text to be corrected is obtained.
Wherein, the mode for obtaining text to be corrected includes but is not limited to:Receive user input text to be corrected, or from
Extracting particular text fragment in the text of family input is used as text to be corrected, or server device to be logged on client device
Accounts information as text to be corrected, etc..
Step 103, characteristic vector corresponding with the text to be corrected is determined using coding rule.
In one embodiment, step 103 can include:
Step 131:Determine coding corresponding with each character in the text to be corrected one by one using coding rule, obtain
Coded sequence corresponding with the text to be corrected.For example, for text to be corrected:" baido ", coded sequence are:
“0110001001100001011010010110010001101111”。
Step 132:Characteristic vector corresponding with the text to be corrected is determined according to the coded sequence.For example, for
Coded sequence:0110001001100001011010010110010001101111, it is determined that characteristic vector be:(98,97,
105,100,111).
Step 105, characteristic vector input text is corrected into model (i.e. seq2seq models), output is waited to entangle with described
Received text corresponding to positive text.
In another embodiment, text correcting method may include steps of:
Step 101, text to be corrected is obtained.
Step 102, the character types belonging to the character in the text to be corrected, from multiple candidate code rules
Choose coding rule corresponding with the character types.
In an optional embodiment, if the character types are Chinese, Chinese character coding rule is chosen;Otherwise, then select
Take ASCII coding rules.
Step 103, characteristic vector corresponding with the text to be corrected is determined using the coding rule of selection.
Step 105, characteristic vector input text is corrected into model (i.e. seq2seq models), output is waited to entangle with described
Received text corresponding to positive text.
Model is corrected based on the text for including coding network and decoding network obtained by machine learning, waits to entangle obtaining
After positive text, characteristic vector corresponding to text to be corrected can be inputted above-mentioned text and correct model, with outputting standard text, realized
The function of being corrected to text, and then in text identification scene, text identification rate can be improved.Text based on seq2seq
This correction model, human intervention (such as the algorithm that artificially lays down a regulation) can be avoided significantly, can cause text identification process more
Intelligence, and accuracy is higher.
On the application scenarios of above-mentioned text correcting method, enumerate several:
1. for identifying the text messages such as the wrong name write or rewritten, place name, exabyte, and match standard literary style.
2. it is used in the scene of identification black list user, the identification for the black list user that information errors are write or rewritten.
3. in information search scene, the wrong query for writing or rewriting of identification, to improve search efficiency.
Corresponding to the above method, this specification one or more embodiment also provides a kind of text correcting device, the problem
Recommendation apparatus can apply to each class of electronic devices.
As shown in figure 3, in one embodiment, a kind of text correcting device 300 can include:Text obtains module 301, turned
Change the mold block 302 and text corrects module 303;Wherein:
Text obtains module 301 and is configured as:Obtain text to be corrected;
Modular converter 302 is configured as:Characteristic vector corresponding with the text to be corrected is determined using coding rule;
Text is corrected module 303 and is configured as:Characteristic vector input text is corrected into model, output is waited to entangle with described
Received text corresponding to positive text, the text, which corrects model, includes coding network and decoding network, and the coding network is conciliate
Code network is Recognition with Recurrent Neural Network RNN.
As shown in figure 4, in another embodiment, based on the device described in Fig. 3, the device 300 can also include rule choosing
Modulus block 304, the rule are chosen module 304 and are configured as:The character types belonging to character in the text to be corrected,
Coding rule corresponding with the character types is chosen from multiple candidate code rules.
In the embodiment shown in fig. 4, modular converter 302 can be configured as:Utilize the coding rule determination of selection and institute
State characteristic vector corresponding to text to be corrected.
In an optional embodiment, the modular converter 302 can specifically include:
Coded sequence determining module, determined one by one and each word in the text to be corrected using the coding rule of selection
Encoded corresponding to symbol, obtain coded sequence corresponding with the text to be corrected;
Vectorization module, characteristic vector corresponding with the text to be corrected is determined according to the coded sequence.
In an optional embodiment, the rule is chosen module 304 and can be configured as:If during the character types are
Text, choose Chinese character coding rule;Otherwise, ASCII coding rules are chosen.
In one embodiment, described device can also include:
Sample obtains module, and obtaining includes the sample sets of some samples pair, the sample to including a non-standard text with
One received text;
Coding vector determining module, for each sample pair, the non-standard text is converted to using coding rule
One coding vector, the received text is converted into the second coding vector;
Model training module, coding network and solution are trained using first coding vector and second coding vector
Code network, obtain text and correct model.
This specification one or more embodiment provides a kind of electronic equipment (such as:User equipment, server or other meters
Calculate equipment), processor, internal bus, network interface, memory (including internal memory and nonvolatile memory) can be included,
Certainly the hardware being also possible that required for other business.Processor can be CPU (CPU), processing unit, processing
Circuit, processor, application specific integrated circuit (ASIC), microprocessor or executable instruction other processing logics in one or more
Individual example.Processor read from nonvolatile memory corresponding to computer program into internal memory then run.Certainly, except
Outside software realization mode, this specification one or more embodiment is not precluded from other implementations, such as logical device suppression
Or mode of software and hardware combining etc., that is to say, that the executive agent of following handling process is not limited to each logic unit,
Can also be hardware or logical device.
In one embodiment, the processor can be configured as:
Obtain text to be corrected;
Characteristic vector corresponding with the text to be corrected is determined using coding rule;
Characteristic vector input text is corrected into model, output received text corresponding with the text to be corrected, institute
Stating text correction model includes coding network and decoding network, and the coding network and decoding network are Recognition with Recurrent Neural Network RNN.
Each embodiment in this specification is described by the way of progressive, identical similar portion between each embodiment
Divide mutually referring to what each embodiment stressed is the difference with other embodiment.It is real especially for equipment
For applying example, device embodiment, because it is substantially similar to embodiment of the method, so description is fairly simple, related part is joined
See the part explanation of embodiment of the method.
System, device, module or the unit that above-described embodiment illustrates, it can specifically be realized by computer chip or entity,
Or realized by the product with certain function.One kind typically realizes that equipment is computer, and the concrete form of computer can
To be personal computer, laptop computer, cell phone, camera phone, smart phone, personal digital assistant, media play
In device, navigation equipment, E-mail receiver/send equipment, game console, tablet PC, wearable device or these equipment
The combination of any several equipment.
For convenience of description, it is divided into various units during description apparatus above with function to describe respectively.Certainly, this is being implemented
The function of each unit can be realized in same or multiple softwares and/or hardware during specification one or more embodiment.
It should be understood by those skilled in the art that, embodiments of the invention can be provided as method, system or computer program
Product.Therefore, the present invention can use the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware
Apply the form of example.Moreover, the present invention can use the computer for wherein including computer usable program code in one or more
The computer program production that usable storage medium is implemented on (including but is not limited to magnetic disk storage, CD-ROM, optical memory etc.)
The form of product.
The present invention is the flow with reference to method according to embodiments of the present invention, equipment (system) and computer program product
Figure and/or block diagram describe.It should be understood that can be by every first-class in computer program instructions implementation process figure and/or block diagram
Journey and/or the flow in square frame and flow chart and/or block diagram and/or the combination of square frame.These computer programs can be provided
The processors of all-purpose computer, special-purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce
A raw machine so that produced by the instruction of computer or the computing device of other programmable data processing devices for real
The device for the function of being specified in present one flow of flow chart or one square frame of multiple flows and/or block diagram or multiple square frames.
These computer program instructions, which may be alternatively stored in, can guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works so that the instruction being stored in the computer-readable memory, which produces, to be included referring to
Make the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one square frame of block diagram or
The function of being specified in multiple square frames.
These computer program instructions can be also loaded into computer or other programmable data processing devices so that counted
Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented processing, so as in computer or
The instruction performed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one
The step of function of being specified in individual square frame or multiple square frames.
In a typical configuration, computing device includes one or more processors (CPU), input/output interface, net
Network interface and internal memory.
Internal memory may include computer-readable medium in volatile memory, random access memory (RAM) and/or
The forms such as Nonvolatile memory, such as read-only storage (ROM) or flash memory (flashRAM).Internal memory is showing for computer-readable medium
Example.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method
Or technology come realize information store.Information can be computer-readable instruction, data structure, the module of program or other data.
The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), moved
State random access memory (DRAM), other kinds of random access memory (RAM), read-only storage (ROM), electric erasable
Programmable read only memory (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc read-only storage (CD-ROM),
Digital versatile disc (DVD) or other optical storages, magnetic cassette tape, the storage of tape magnetic rigid disk or other magnetic storage apparatus
Or any other non-transmission medium, the information that can be accessed by a computing device available for storage.Define, calculate according to herein
Machine computer-readable recording medium does not include temporary computer readable media (transitorymedia), such as data-signal and carrier wave of modulation.
It should also be noted that, term " comprising ", "comprising" or its any other variant are intended to nonexcludability
Comprising so that process, method, commodity or equipment including a series of elements not only include those key elements, but also wrapping
Include the other element being not expressly set out, or also include for this process, method, commodity or equipment intrinsic want
Element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that wanted including described
Other identical element also be present in the process of element, method, commodity or equipment.
It will be understood by those skilled in the art that the embodiment of this specification one or more embodiment can be provided as method, be
System or computer program product.Therefore, this specification one or more embodiment can use complete hardware embodiment, complete software
The form of embodiment in terms of embodiment or combination software and hardware.Moreover, this specification one or more embodiment can use
The computer-usable storage medium for wherein including computer usable program code in one or more (includes but is not limited to disk
Memory, CD-ROM, optical memory etc.) on the form of computer program product implemented.
This specification one or more embodiment can computer executable instructions it is general on
Described in hereafter, such as program module.Usually, program module includes performing particular task or realizes particular abstract data type
Routine, program, object, component, data structure etc..Can also put into practice in a distributed computing environment this specification one or
Multiple embodiments, in these DCEs, by being performed by communication network and connected remote processing devices
Task.In a distributed computing environment, the local and remote computer that program module can be located at including storage device is deposited
In storage media.
The embodiment of this specification one or more embodiment is the foregoing is only, is not limited to this specification
One or more embodiments.To those skilled in the art, this specification one or more embodiment can have it is various more
Change and change.It is all this specification one or more embodiment spirit and principle within made any modification, equivalent substitution,
Improve etc., it should be included within the right of this specification one or more embodiment.