Summary of the invention
The embodiment of the present application purpose is to provide a kind of text restoring method, device and electronic equipment, can express dividing by means of characters
Variation text revert back normal text.
To achieve the goals above, the embodiment of the present application is achieved in that
In a first aspect, providing a kind of text restoring method, comprising:
Obtain target text;
Word segmentation processing is carried out to the target text, the participle text after obtaining the target text participle, the participle
Text includes the character that can not form participle;
Based on dividing by means of characters sample set, the character that participle can not be formed in the participle text is matched, obtains at least one
Text is segmented after kind matching;
Text input preset language model will be segmented after at least one set matching, divided after obtaining at least one set of matching
The confidence level of word text;
Based on the confidence level for segmenting text after at least one set of matching, segmented in text after at least one matching
Select the target text goes back original text.
Second aspect provides a kind of text reduction apparatus, comprising:
Module is obtained, target text is obtained;
Word segmentation module carries out word segmentation processing to the target text, the participle text after obtaining the target text participle,
The participle text includes the character that can not form participle;
Matching module matches the character that can not form participle in the participle text, is obtained based on dividing by means of characters sample set
Text is segmented after at least one matching;
Evaluation module will segment text input preset language model after at least one set of matching, obtain described at least one
The confidence level of text is segmented after group matching;
Module is chosen, based on the confidence level for segmenting text after at least one set of matching, after at least one matching
The target text is selected in participle text goes back original text.
The third aspect provides a kind of electronic equipment, comprising: memory, processor and is stored on the memory simultaneously
The computer program that can be run on the processor, the computer program are executed by the processor:
Obtain target text;
Word segmentation processing is carried out to the target text, the participle text after obtaining the target text participle, the participle
Text includes the character that can not form participle;
Based on dividing by means of characters sample set, the character that participle can not be formed in the participle text is matched, obtains at least one
Text is segmented after kind matching;
Text input preset language model will be segmented after at least one set matching, divided after obtaining at least one set of matching
The confidence level of word text;
Based on the confidence level for segmenting text after at least one set of matching, segmented in text after at least one matching
Select the target text goes back original text.
Fourth aspect provides a kind of computer readable storage medium, is stored on the computer readable storage medium
Computer program, the computer program realize following steps when being executed by processor:
Obtain target text;
Word segmentation processing is carried out to the target text, the participle text after obtaining the target text participle, the participle
Text includes the character that can not form participle;
Based on dividing by means of characters sample set, the character that participle can not be formed in the participle text is matched, obtains at least one
Text is segmented after kind matching;
Text input preset language model will be segmented after at least one set matching, divided after obtaining at least one set of matching
The confidence level of word text;
Based on the confidence level for segmenting text after at least one set of matching, segmented in text after at least one matching
Select the target text goes back original text.
As can be seen from the technical scheme provided by the above embodiments of the present application, the embodiment of the present application first divides target text
Word processing, determines the character that can not form participle, these characters that can not form participle are carried out as the matched object that divides by means of characters
Matching reduction segments text after obtaining at least one matching.Later, it is segmented after being matched by preset language model at least one
Text carries out the assessment of confidence level, and segments text as target text after preferentially filtering out optimal matching based on confidence level
Also original text.The variation text that dividing by means of characters is expressed effectively can be reduced into normal text by the scheme of the embodiment of the present application, can be improved
Recognition capability of the network platform to junk information.
Specific embodiment
In order to make those skilled in the art better understand the technical solutions in the application, below in conjunction with the application reality
The attached drawing in example is applied, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described implementation
Example is merely a part but not all of the embodiments of the present application.Based on the embodiment in the application, this field is common
The application protection all should belong in technical staff's every other embodiment obtained without creative efforts
Range.
As previously mentioned, greyish black production at present can send the junk information that dividing by means of characters mode is expressed, to bypass the supervision of the network platform.
In view of this, the application is intended to provide a kind of technical solution that the variation text of dividing by means of characters expression can be reverted back to normal text,
The network platform can be improved to the recognition capability of junk information.
Fig. 1 is the flow chart of one embodiment text restoring method of the application.The text restoring method of Fig. 1 can be by text
Reduction apparatus executes.This method comprises:
Step S102 obtains target text.
For step S102:
The embodiment of the present application does not make specific limit to the source of target text.
As exemplary introduction, target text can be the text envelope that the user got from network social intercourse platform sends
Breath.
For example, evaluation information, the chat message etc. of user's transmission can be obtained from shopping at network platform.
It should be understood that but all network platforms need the information object supervised, all can serve as target text.
Step S104 carries out word segmentation processing to target text, and the participle text after obtaining target text participle segments text
Character comprising participle can not be formed.
For step S104:
Existing any segmenting method can be used in the present embodiment, word segmentation processing is carried out to target text, so that it is determined that mesh out
The character of participle can not be formed in mark text.
As exemplary introduction, it may include: Chinese character, the radical of Chinese character, the Chinese that determines, which can not form the character of participle,
Any one in the radical of word, these character high probabilities that can not form participle are expressed in a manner of dividing by means of characters, be it is subsequent into
The key object of row dividing by means of characters identification.
Step S106 is matched based on dividing by means of characters sample set to that can not form the character of participle in participle text, obtain to
Text is segmented after a kind of few matching.
For step S106:
The sample set that divides by means of characters includes pre-set dividing by means of characters expression-form." flower ", " excuse shellfish " for example, " fancy top shellfish " correspondence
Corresponding " borrow ", " Ren former times money " correspondence " loaning bill ", " Ren former times money " correspondence " borrowing money " etc. for certain words dividing by means of characters expression-form,
Can be " Ren former times " it is corresponding borrow, " mouth shellfish " correspondence " " etc. for a certain Chinese character dividing by means of characters expression-form.
In this step, by the sample set that divides by means of characters, dividing by means of characters can be carried out to the character that can not form participle in participle text
Match, reverts back the information of normal expression.
Specifically, it can be matched to the character that can not form participle that line direction is adjacent in text is segmented.
For example, participle text is " six directions adopts San month shellfish and million $ ", dividing by means of characters sample set record " adopting San " corresponds to " coloured silk ",
" shellfish is simultaneous " corresponds to " earning ".It is known that " adopting ", " San ", " moon ", " shellfish ", " simultaneous " " $ " be to segment not being determined as in text
The character of molecule then matches above-mentioned adjacent " adopting ", " San ", " moon ", " shellfish " ", and " based on dividing by means of characters sample set, obtains
Matching after molecule text are as follows: " the lottery ticket moon earns million ".
Similarly, it can also be matched to the character that can not form participle that column direction is adjacent in text is segmented;
For example, participle text are as follows: " add cell-phone number xx, it can be low from arbitrage
The heart ";
It can then be matched based on dividing by means of characters sample set, " certainly " adjacent to column direction, " heart ", be divided after the matching determined
Ziwen sheet are as follows: " add cell-phone number xx, can low interest arbitrage ".
Step S108 will segment text input preset language model after at least one set matching, obtain at least one set matching
The confidence level of text is segmented afterwards;
For step 108:
It should be understood that participle text not necessarily correctly goes back original text after the matching determined based on dividing by means of characters sample set
This, it is therefore desirable to the confidence level that text is segmented after matching is assessed using preset language model evaluation.Text is segmented after matching
The size of this confidence level is able to reflect the reduction accuracy rate of participle text after the matching.
It should be understood that preset language model be according to actual application scenarios flexible setting, the embodiment of the present application to this not
Make specific limit.
As exemplary introduction, it is assumed that the scheme of the embodiment of the present application is for restoring the rubbish expressed in a manner of dividing by means of characters in network
Rubbish information.Preset language model can be obtained by the training of junk information sample set.Text is segmented after by least one set matching
After inputting preset language model, evaluation criteria of the preset language model based on junk information is literary to segmenting after at least one set matching
This confidence level is given a mark.Wherein, the confidence level score value that text is segmented after matching is higher, then more may be junk information, right
The reduction accuracy rate answered is also higher.
Alternatively, the preset language model of the embodiment of the present application is using the expression way of correct sentence as evaluation criteria, to extremely
The confidence level that text is segmented after few one group of matching is given a mark.For example, the correct sentence structure of " subject and predicate, guest " is based on, at least
The confidence level for segmenting text after one group of matching is given a mark.Wherein, the confidence level score value that text is segmented after matching is higher, then corresponds to
Reduction accuracy rate it is also higher.
Since the implementation of preset language model is not unique, no longer citing is repeated herein.
Step S110 is divided after at least one matching based on the confidence level for segmenting text after above-mentioned at least one set of matching
Target text is selected in word text goes back original text.
For step S110:
This step can segment after above-mentioned at least one matching and choose one of confidence level highest in text as target
Text goes back original text.
In the embodiment of the present application, word segmentation processing is carried out to target text first, determines the character that can not form participle, this
The character that participle can not be formed a bit carries out matching reduction as the matched object that divides by means of characters, and segments text after obtaining at least one matching
This.Later, the assessment by preset language model to text progress confidence level is segmented after at least one matching, and based on confidence level
Participle text is preferentially filtered out after optimal matching as target text and goes back original text.The scheme of the embodiment of the present application can have
The variation text of dividing by means of characters expression is reduced into normal text by effect, and the network platform can be improved to the recognition capability of junk information.
It describes in detail below to the process of the text restoring method of the embodiment of the present application in practical applications.
The main flow of the text restoring method of the embodiment of the present application includes:
Step 1 obtains target text;
In this step, it can obtain and be sent by user from network social intercourse platform (such as communication software, online shopping software)
Target text.
As exemplary introduction, it is assumed that the content of target text is " need Ren former times money, power mouth my cell-phone number ".Obviously, the mesh
Marking text is the junk information expressed in a manner of dividing by means of characters.
Step 2 determines participle text;
In this step, word segmentation processing can be carried out to " need Ren former times money, power mouth my cell-phone number ".For convenience of understanding, it is segmented
Between with space-separated, corresponding obtained participle text are as follows: " need Ren former times money, power mouth my cell-phone number ".
It should be understood that " needs ", " I ", " cell-phone number " can be determined as segmenting in above-mentioned target text, " Ren ", " former times ",
" money ", " power ", " mouth " are can not be as the character of participle.
Step 3, dividing by means of characters matching;
In this step, dividing by means of characters matching is carried out to above-mentioned participle text using dividing by means of characters table resource, wherein " Ren former times " can match and be
" borrowing ", " power mouth " can match as " adding ", and " mouth I " can match as matching " ", finally obtained based on dividing by means of characters table resource
It includes following two that text is segmented after matching:
The first is " need to borrow money, add my cell-phone number ";
Second is " needing to borrow money, power cell-phone number ".
Step 4, confidence level estimation;
In this step, text input preset language model will be segmented after two kinds of matchings of step 3 kind, " needed with calculating
Borrow money, add my cell-phone number " confidence level P1 and " needing to borrow money, power cell-phone number " confidence level P2.
Wherein, preset language model can be disaggregated model, be obtained by the junk information sample training illegally borrowed money.
For example, can using it is some with illegally borrow money common feature as the input vector of preset language model, and pass through
Junk information sample is trained preset language model, to continue to optimize the weight of input vector.
" need to borrow money, add my cell-phone number " and " needing to borrow money, power cell-phone number " is being input to the default of training completion
After language model, it is clear that the former, which has, illegally borrows money common attribute " add my cell-phone number ", therefore after inputting disaggregated model, can obtain
To higher confidence level.
It should be noted that the embodiment of the present application not using function make specifically to limit by preset language model.But it is all
Function for classification may be suitable for the preset language model of the embodiment of the present application.
Step 5, probability compare;
In this step, to the confidence level for segmenting text after the confidence level and second of matching for segmenting text after the first matching
It carries out size comparison (P1 > P2).Obviously, confidence level is one of larger higher as the probability for correctly going back original text.
Step 6 restores text output;
In this step, the comparison result (P1 > P2) based on step 5, final output go back original text be " need to borrow money,
Add my cell-phone number ".
In conclusion the text restoring method of the embodiment of the present application can identify the character that the dividing by means of characters of target text indicates,
And carry out matching reduction.In the specific implementation, word segmentation processing first is carried out to target text, can only will be unable to the word as participle
Symbol so that matching times be effectively reduced, and improves matched accuracy rate as the matched object that divides by means of characters.And then combine language
Speech model further preferentially screens text of the participle text as target text after optimal matching.The calculating letter of entire scheme
It is single, it needs to occupy that process resource is relatively fewer, is therefore particularly suitable for the junk information of network platform identification dividing by means of characters expression.
Fig. 3 is the structural schematic diagram of one embodiment electronic equipment of the application.Referring to FIG. 3, in hardware view, the electricity
Sub- equipment includes processor, optionally further comprising internal bus, network interface, memory.Wherein, memory may be comprising interior
It deposits, such as high-speed random access memory (Random-Access Memory, RAM), it is also possible to further include non-volatile memories
Device (non-volatile memory), for example, at least 1 magnetic disk storage etc..Certainly, which is also possible that other
Hardware required for business.
Processor, network interface and memory can be connected with each other by internal bus, which can be ISA
(Industry Standard Architecture, industry standard architecture) bus, PCI (Peripheral
Component Interconnect, Peripheral Component Interconnect standard) bus or EISA (Extended Industry Standard
Architecture, expanding the industrial standard structure) bus etc..The bus can be divided into address bus, data/address bus, control always
Line etc..Only to be indicated with a four-headed arrow in Fig. 3, it is not intended that an only bus or a type of convenient for indicating
Bus.
Memory, for storing program.Specifically, program may include program code, and said program code includes calculating
Machine operational order.Memory may include memory and nonvolatile memory, and provide instruction and data to processor.
Processor is from the then operation into memory of corresponding computer program is read in nonvolatile memory, in logical layer
Question and answer are formed on face to data mining device.Processor executes the program that memory is stored, and is specifically used for executing following behaviour
Make:
Obtain target text;
Word segmentation processing is carried out to the target text, the participle text after obtaining the target text participle, the participle
Text includes the character that can not form participle;
Based on dividing by means of characters sample set, the character that participle can not be formed in the participle text is matched, obtains at least one
Text is segmented after kind matching;
Text input preset language model will be segmented after at least one set matching, divided after obtaining at least one set of matching
The confidence level of word text;
Based on the confidence level for segmenting text after at least one set of matching, segmented in text after at least one matching
Select the target text goes back original text.
The text restoring method that the application embodiment illustrated in fig. 1 discloses can be applied in processor, or by processor
It realizes.Processor may be a kind of IC chip, the processing capacity with signal.During realization, the above method
Each step can be completed by the integrated logic circuit of the hardware in processor or the instruction of software form.Above-mentioned processor
It can be general processor, including central processing unit (Central Processing Unit, CPU), network processing unit
(Network Processor, NP) etc.;Can also be digital signal processor (Digital Signal Processor,
DSP), specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate
Array (Field-Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or crystalline substance
Body pipe logical device, discrete hardware components.May be implemented or execute disclosed each method in the embodiment of the present application, step and
Logic diagram.General processor can be microprocessor or the processor is also possible to any conventional processor etc..In conjunction with
The step of method disclosed in the embodiment of the present application, can be embodied directly in hardware decoding processor and execute completion, or with decoding
Hardware and software module combination in processor execute completion.Software module can be located at random access memory, flash memory, read-only storage
In the storage medium of this fields such as device, programmable read only memory or electrically erasable programmable memory, register maturation.It should
The step of storage medium is located at memory, and processor reads the information in memory, completes the above method in conjunction with its hardware.
The electronic equipment can also carry out method shown in FIG. 1, and realize text reduction apparatus in Fig. 1, embodiment illustrated in fig. 2
Function, no longer repeated herein.
Certainly, other than software realization mode, other implementations are not precluded in the electronic equipment of the application, for example patrol
Collect device or the mode of software and hardware combining etc., that is to say, that the executing subject of following process flow is not limited to each patrol
Unit is collected, hardware or logical device are also possible to.
The embodiment of the present application also proposed a kind of computer readable storage medium, the computer-readable recording medium storage one
A or multiple programs, the one or more program include instruction, and the instruction is when by the portable electronic including multiple application programs
When equipment executes, the method that the portable electronic device can be made to execute embodiment illustrated in fig. 1, and be specifically used for executing with lower section
Method:
Obtain target text;
Word segmentation processing is carried out to the target text, the participle text after obtaining the target text participle, the participle
Text includes the character that can not form participle;
Based on dividing by means of characters sample set, the character that participle can not be formed in the participle text is matched, obtains at least one
Text is segmented after kind matching;
Text input preset language model will be segmented after at least one set matching, divided after obtaining at least one set of matching
The confidence level of word text;
Based on the confidence level for segmenting text after at least one set of matching, segmented in text after at least one matching
Select the target text goes back original text.
It should be understood that text reduction apparatus may be implemented when present treatment executes in the computer readable storage medium of the application
In Fig. 1, the function of embodiment illustrated in fig. 2, no longer repeated herein.
Fig. 4 is the structural schematic diagram of one embodiment text reduction apparatus 400 of the application, comprising:
Module 410 is obtained, target text is obtained;
Word segmentation module 420 carries out word segmentation processing to the target text, the participle text after obtaining the target text participle
This, the participle text includes the character that can not form participle;
Matching module 430, the character progress based on dividing by means of characters sample set, to participle can not be formed in the participle text
Match, segments text after obtaining at least one matching;
Evaluation module 440 will segment text input preset language model after at least one set of matching, obtain it is described at least
The confidence level of text is segmented after one group of matching;
Module 450 is chosen, based on the confidence level for segmenting text after at least one set of matching, from least one matching
Select the target text in participle text afterwards goes back original text.
The embodiment of the present application carries out word segmentation processing to target text first, determines the character that can not form participle, these
The character that participle can not be formed carries out matching reduction as the matched object that divides by means of characters, and segments text after obtaining at least one matching.
Later, the assessment of confidence level is carried out to participle text after at least one matching by preset language model, and is selected based on confidence level
It is excellent to filter out after optimal matching participle text as target text and go back original text.The scheme of the embodiment of the present application can be effective
The variation text of dividing by means of characters expression is reduced into normal text, the network platform can be improved to the recognition capability of junk information.
Optionally, as one embodiment, matching module 430 is specifically used for:
Based on dividing by means of characters sample resource, the character progress that can not form participle adjacent to line direction in the participle text
Match.
Optionally, as one embodiment, matching module 430 is specifically used for:
Based on dividing by means of characters sample resource, ranks in the participle text are carried out to the adjacent character that can not form participle
Match.
Optionally, it as one embodiment, chooses module 450 and is specifically used for:
One of confidence level highest is chosen in text as the target text from segmenting after at least one matching
Also original text.
Optionally, as one embodiment, can not form the character of participle in the participle text includes: Chinese character, Chinese character
Radical, Chinese character radical in any one.
Optionally, as one embodiment, the preset language model is based on the training of junk information sample set and obtains.
Optionally, it as one embodiment, obtains module 410 and is specifically used for:
From network social intercourse platform, the target text that user sends is obtained.
It should be understood that the method that Fig. 1 can be performed in the text reduction apparatus of the embodiment of the present application, and realize this method in Fig. 1, figure
The function of 2 illustrated embodiments, is no longer repeated herein.
It will be understood by those skilled in the art that the embodiment of this specification can provide as the production of method, system or computer program
Product.Therefore, complete hardware embodiment, complete software embodiment or implementation combining software and hardware aspects can be used in this specification
The form of example.Moreover, it wherein includes the computer of computer usable program code that this specification, which can be used in one or more,
The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces
The form of product.
It is above-mentioned that this specification specific embodiment is described.Other embodiments are in the scope of the appended claims
It is interior.In some cases, the movement recorded in detail in the claims or step can be come according to the sequence being different from embodiment
It executes and desired result still may be implemented.In addition, process depicted in the drawing not necessarily require show it is specific suitable
Sequence or consecutive order are just able to achieve desired result.In some embodiments, multitasking and parallel processing be also can
With or may be advantageous.
The above is only the embodiments of this specification, are not limited to this specification.For those skilled in the art
For, this specification can have various modifications and variations.All any modifications made within the spirit and principle of this specification,
Equivalent replacement, improvement etc., should be included within the scope of the claims of this specification.