CN108628822A - Recognition methods without semantic text and device - Google Patents

Recognition methods without semantic text and device Download PDF

Info

Publication number
CN108628822A
CN108628822A CN201710182218.7A CN201710182218A CN108628822A CN 108628822 A CN108628822 A CN 108628822A CN 201710182218 A CN201710182218 A CN 201710182218A CN 108628822 A CN108628822 A CN 108628822A
Authority
CN
China
Prior art keywords
text
identified
training sample
probability
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710182218.7A
Other languages
Chinese (zh)
Other versions
CN108628822B (en
Inventor
江南
祝慧佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Advantageous New Technologies Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201710182218.7A priority Critical patent/CN108628822B/en
Publication of CN108628822A publication Critical patent/CN108628822A/en
Application granted granted Critical
Publication of CN108628822B publication Critical patent/CN108628822B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

This application involves the recognition methods of field of computer technology more particularly to a kind of no semantic text and devices, in a kind of recognition methods of no semantic text, obtain text to be identified, and pre-process to text to be identified.The each word sequence for determining pretreated text to be identified determines the probability score value of each word sequence according to N gram language models.According to the probability score value of each word sequence and the number of word sequence, the average probability fractional value and/or probability score standard deviation of text to be identified are determined.According to average probability fractional value and/or probability score standard deviation, the composite score value of text to be identified is determined.It is no semantic text by text identification to be identified when composite score value meets preset condition.Thus, it is possible to improve the accuracy identified to no semantic text and comprehensive.

Description

Recognition methods without semantic text and device
Technical field
This application involves the recognition methods of field of computer technology more particularly to a kind of no semantic text and devices.
Background technology
In traditional technology, no semantic text is mainly identified by the following two kinds method:
First method is, by the method for supervised machine learning, i.e., manually collects some in advance without semantic text Semantic feature, e.g., mutation word, additional character etc., and whether be marked for the sample in corpus is no semantic text, so Corpus and semantic feature, training identification model is utilized finally to input whether text is no language by identification model to identify afterwards Adopted text.However, in the method, if containing the semantic feature that do not collected in input text, if alternatively, input text Deformed semantic feature is contained, then cannot be no semantic text by the input text identification, this affects no semantic text The accuracy of identification.In addition, this method usually requires to expend a large amount of manpowers to be labeled to the sample in corpus, this influence Efficiency without semantic text identification.
Second method is the method for calculating similarity, i.e., the content text obtained to user's report or by other channels This is achieved, and to generate sample database, the similarity of text and content text in sample database is inputted by calculating later, to know Not Shu Ru text whether be no semantic text.However, this method is typically only capable to the content text that identification has occurred, None- identified Novel content text can not possibly enumerate all content texts in today that information increasingly expands by artificial mode, So that all input texts cannot be identified for this method namely second method is not complete to the identification of no semantic text Face.
Invention content
This application describes a kind of recognition methods of no semantic text and device, it can improve and no semantic text is identified Accuracy and comprehensive.
In a first aspect, a kind of recognition methods of no semantic text is provided, including:
Obtain text to be identified;
The text to be identified is pre-processed;
Determine each word sequence of pretreated text to be identified;
According to N-gram language models, the probability score value of each word sequence is determined;
According to the probability score value of each word sequence and the number of the word sequence, the text to be identified is determined Average probability fractional value and/or probability score standard deviation;
According to the average probability fractional value and/or the probability score standard deviation, the text to be identified is determined Composite score value;
It is no semantic text by the text identification to be identified when the composite score value meets preset condition.
Second aspect provides a kind of identification device of no semantic text, including:
Acquiring unit, for obtaining text to be identified;
Pretreatment unit, for being pre-processed to the text to be identified that the acquiring unit obtains;
Determination unit, each word sequence for determining the pretreated text to be identified of the pretreatment unit;
The determination unit is additionally operable to determine the probability score value of each word sequence according to N-gram language models;
The determination unit is additionally operable to of the probability score value and the word sequence according to each word sequence Number, determines the average probability fractional value and/or probability score standard deviation of the text to be identified;
The determination unit is additionally operable to according to the average probability fractional value and/or the probability score standard deviation, really The composite score value of the fixed text to be identified;
Recognition unit will be described when the composite score value for being determined when the determination unit meets preset condition Text identification to be identified is no semantic text.
The recognition methods of no semantic text provided by the present application and device obtain text to be identified, and to text to be identified It is pre-processed.The each word sequence for determining pretreated text to be identified determines each word according to N-gram language models The probability score value of sequence.According to the probability score value of each word sequence and the number of word sequence, text to be identified is determined Average probability fractional value and/or probability score standard deviation.According to average probability fractional value and/or probability score standard deviation, Determine the composite score value of text to be identified.It is no language by text identification to be identified when composite score value meets preset condition Adopted text.Thus, it is possible to improve the accuracy identified to no semantic text and comprehensive.
Description of the drawings
In order to illustrate the technical solution of the embodiments of the present invention more clearly, required use in being described below to embodiment Attached drawing be briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this For the those of ordinary skill of field, without creative efforts, others are can also be obtained according to these attached drawings Attached drawing.
Fig. 1 is the recognition methods flow chart without semantic text that a kind of embodiment of the application provides;
Fig. 2 is the method flow diagram of model training provided by the present application;
Fig. 3 is the determining method flow diagram for normalizing formula provided by the present application;
Fig. 4 is the schematic diagram of the recognition methods of no semantic text provided by the present application;
Fig. 5 is the identification device schematic diagram without semantic text that the application another kind embodiment provides.
Specific implementation mode
Below in conjunction with the accompanying drawings, embodiments herein is described.
The recognition methods of no semantic text provided by the embodiments of the present application is suitable for low quality text (also referred to as without semantic text This or rubbish text) scene that is identified.Low quality text herein includes but not limited to following text:Pass through keyboard etc. Input unit arbitrarily taps the text of generation;The text of the contents such as political affairs and non-targeted languages (e.g., day are feared cruelly containing yellow relate to is related to Language, Korean, Russian etc.) text etc..
Fig. 1 is the recognition methods flow chart without semantic text that a kind of embodiment of the application provides.The execution of the method Main body can be the equipment with processing capacity:Server either system or device, as shown in Figure 1, the method specifically may be used To include:
Step 110, text to be identified is obtained.
Text to be identified herein can be the text of Chinese, or English text.It is of course also possible to be other The text of target language (Korean, Japanese etc.).
Step 120, text to be identified is pre-processed.
Herein, it includes following any one or more steps to carry out pretreatment to text to be identified:1. removing text to be identified Interference element in this.Wherein, interference element may include:Emoticon is (e.g., with the emoji of unicode codes storage or such as " ^_^ " etc.), address uniform resource locator (Uniform Resoure Locator, url) etc..In one implementation, The interference element in text to be identified can be removed using regular expression.2. will identify that the complex form of Chinese characters in text is converted to letter Body word.Such as, character library can be corresponded to according to either traditional and simplified characters, the complex form of Chinese characters in text to be identified is converted into simplified Chinese character, thus dropped The parameter space size of low language model.3. the numerical character in text to be identified is converted to predetermined format.Number herein Can refer to telephone number or the amount of money etc. in text to be identified.In one implementation, regular expression can be utilized, Digit strings are extracted from text to be identified, and are converted into the form such as " Num+ { length of digit strings } ", by This realizes the purpose of digital dimensionality reduction, wherein " Num " is for indicating that current string is digit strings.It for example, can be with Digit strings " 123456 " are converted into Num6, digit strings 1111111111 can be converted to Num10.4. treating knowledge Other text carries out clause's fractionation.Such as, text to be identified can be split into using punctuation mark (e.g., comma, fullstop, question mark etc.) Clause.In one example, the function of clause's fractionation can be completed using the built-in function of Java language.It needs to illustrate It is that, by carrying out clause's fractionation to text to be identified, can achieve the purpose that computer distribution type is handled.
It should be noted that above-mentioned pretreated four steps can be used in conjunction with actual conditions, flexible combination.For example, When text to be identified is the text of English, because there is no the complex form of Chinese characters in the text of English, it is possible to only carry out 1., 3. and 4. Three steps.In addition, for each step, it can also be in conjunction with actual conditions flexible, such as, if it is desired to identification input text In url, then can not execute removal the addresses url the step of.
It should also be noted that, the pretreatment of the application can not also be limited to aforementioned four step, can also include other Step, e.g., word segmentation processing etc., the application is not construed as limiting this.
Step 130, each word sequence of pretreated text to be identified is determined.
Specifically, can be first according to N-gram language models, to determine each word combination in text to be identified. And then the word combination for containing n word is chosen from each word combination.Wherein, when N-gram language models are When unigram, n=1.When N-gram language models are bi-gram, n=2.When N-gram language models are tri-gram, N=3.And so on.Using the above-mentioned word combination for containing n word as the word sequence of text to be identified.Wherein it is determined that waiting for The process of each word combination in identification text illustrates during will introduce below trained N-gram language models.
Step 140, according to N-gram language models, the probability score value of each word sequence is determined.
N-gram language models are a kind of statistical language models, can be based on large-scale training sample and generate.n-gram Language model based on it is such a it is assumed that n-th of word appearance only it is related to the word of front n-1, and with other any words Language is all uncorrelated.Therefore, the number that can be simultaneously occurred in corpus according to the word of front n-1, to calculate n-th of word The probability score value of appearance.And the probability score value that a sentence occurs is exactly multiplying for the probability score value that each word occurs Product.For above-mentioned n-gram language models, as n=1, which is properly termed as unigram.As n=2, the language Model is properly termed as bi-gram.As n=3, which is properly termed as tri-gram, and so on.
Above-mentioned theory is understood as follows with mathematical formulae:
Wherein, wiFor i-th of word, wi-n+1,…,wi-1For the preceding n-1 word of i-th of word, p (wi|wi-n+1,..., wi-1) it is the probability score value that i-th of word occurs, C (wi-n+1,...,wi) it is that n word including i-th of word is same When the number that occurs in corpus, C (wi-n+1,...,wi-1) be i-th of word preceding n-1 word and meanwhile in corpus The number of appearance.
For one by word w1、w2、w3,…,wmThe sentence T of composition, it is likely to occur under n-gram language models Probability score value is:
Since the parameter space of n-gram language models is very big, theoretically the exponential series of the quantity of word are according to rule Mould, so the low low-frequency word of occurrence number would generally be filtered out according to sixteen rules in practice or be pre-configured with a dictionary Which word is only counted when for being limited in statistics word occurrence number, while calculating the probability point that entire sentence occurs When numerical value, to ensure the probability score that will not have (i.e. probability is 0) that entire sentence is caused to occur because of a word Value is all 0, and such as Laplacian algorithm, Kneser-Ney, StupidBackoff scheduling algorithm may be used and be smoothed.Most Afterwards because the probability numbers that entire sentence occurs very small can may can use the number such as logarithmic function to several after decimal point It learns to do section and carries out numerical value enhanced processing.
In one example, the problem of calculating the probability score value that a sentence occurs, can be converted to each word order that adds up Existing probability score value is listed, which is less than 0, and formula is as follows:
Wherein, T is entire sentence, and Score (T) is the probability score value that sentence T occurs, w1,w2,...,wmFor in sentence T M whole words.
Before executing step 140, N-gram language models can be first trained.Specific training process is subsequently said It is bright.
In one example, N-gram language models can be stated according to following format:
" number that the word combination space word combination occurs in corpus "
Such as, it can be expressed as:
unigram:
w1 10
w2 10
w3 10
……
wn 10
bi-gram:
w1w2 5
w1w3 5
w2w3 5
……
wn-1wn 10
tri-gram:
w1w2w3 2
w1w3w4 3
w2w3wn 5
……
wn-2wn-1wn 8
It can be seen that in the example from above-mentioned form of presentation and each language model all stated.
It returns in step 140, step 140 is specifically as follows:The probability score of each word sequence is determined according to formula 1 Value.To determine word sequence w1w2Probability score value for for, the probability score value of the word sequence is:
Wherein, p (w2|w1) it is word sequence w1w2Probability score value, C (w1w2) it is word combination w1w2The number of appearance, C (w1) it is word combination w1The number of appearance.Because containing word combination w in above-mentioned N-gram language models1w2And w1Occur Number, so, it can directly determine word sequence w1w2Probability score value.According to the probability score value of said one word sequence Determine method, it may be determined that the probability score value of all word sequences.
Step 150, according to the probability score value of each word sequence and the number of word sequence, the flat of text to be identified is determined Equal probability score value and/or probability score standard deviation.
In one example, above-mentioned average probability fractional value can be calculated according to formula 4.
Wherein, T is text to be identified, and AvgScore (T) is the average probability fractional value of text T to be identified, Score (T) For the probability score value of text T to be identified, Count (T) is the number of word sequence included in text T to be identified.Score (T) it can be calculated according to formula 3, namely after taking logarithm to sum again the probability score value of each word sequence of T, so that it may with Obtain Score (T).
It should be noted that in the application, the mathematical function library of increasing income of Apache Software Foundation may be used CommonMath calculates the probability score standard deviation SdScore of text to be identified.
Step 160, according to average probability fractional value and/or probability score standard deviation, the synthesis of text to be identified is determined Fractional value.
It should be noted that step 160 includes three kinds of situations:The first situation is determined according to average probability fractional value The composite score value of text to be identified.Specifically, can be that average probability fractional value is directly determined as the comprehensive of text to be identified Close fractional value;It can also be that first average probability fractional value is normalized according to preset first function formula, obtain First handling result is determined as the composite score value of text to be identified by the first handling result;It can also be and obtaining at first After managing result, the first handling result can be amplified (e.g., amplify 100 times), and by amplified first handling result It is determined as the composite score value of text to be identified.The second situation determines text to be identified according to probability score standard deviation Composite score value.Specifically, it can be the composite score value that probability score standard deviation is directly determined as to text to be identified; It can be that first probability score standard deviation is normalized according to preset second function formula, obtain second processing knot Second processing result is determined as the composite score value of text to be identified by fruit;It can also be after obtaining second processing result, Second processing result can be amplified (e.g., amplify 100 times), and amplified second processing result is determined as to be identified The composite score value of text.The third situation determines text to be identified according to average probability fractional value and probability score standard deviation This composite score value.In a kind of specific implementation, the third situation can be implemented by the following steps:
Average probability fractional value is normalized in step A, obtains the first handling result.
Herein, can be that average probability fractional value is normalized according to preset first function formula.
Probability score standard deviation is normalized in step B, obtains second processing result.
Likewise, can be that place is normalized to probability score standard deviation according to preset second function formula Reason.
It should be noted that the setting process of above-mentioned first function formula and second function formula subsequently illustrates.
Step C compares the first handling result and second processing result.
Step D, according to the first handling result, determines the composite score of text to be identified when the first handling result is larger Value.
In a kind of specific implementation, the first handling result can be directly determined as to the composite score of text to be identified Value.In another specific implementation, can also processing (e.g., amplifying 100 times) first be amplified to the first handling result.It The first handling result after enhanced processing is determined as to the composite score value of text to be identified afterwards.
Step E, when second processing result is larger, according to second processing as a result, determining the composite score of text to be identified Value.
In a kind of specific implementation, second processing result can be directly determined as to the composite score of text to be identified Value.In another specific implementation, can also processing (e.g., amplifying 100 times) first be amplified to second processing result.It Second processing result after enhanced processing is determined as to the composite score value of text to be identified afterwards.
In one example, when amplifying 100 times to the first handling result or second processing result, above-mentioned steps C- steps Rapid E can be expressed as formula 5.
GloScore (S)=100*max (F1(AvgScore(S)),F2(SdScore (S))) (formula 5)
Wherein, S is text to be identified, and GloScore (S) is the composite score value of text to be identified, F1(AvgScore (S)) it is the first handling result, F2(SdScore (S)) is second processing result.
Step 170, it is no semantic text by text identification to be identified when composite score value meets preset condition.
It in one implementation, can be with preset threshold value.Specifically, it when composite score value is less than threshold value, determines The composite score value meets preset condition, so as to be no semantic text by the text identification to be identified.
It should be noted that the application is text based average probability fractional value and/or probability score standard deviation two A data identify no semantic text, while proposing the thought of two values normalization and amplification data, thus may be used To solve in traditional technology, when text size is long, due to it includes word sequence it is more, cause calculate probability score Value is smaller, thus the problem of can not being accurately identified to the class text.Further, it is also possible to solve potential risk in traditional technology User can break through identification model by modes such as dividing by means of characters or mutation, thus the problem of realizing the publication without semantic text.I.e. Just identification model, which subsequently has to update also to have lagged to lead to not intercept in time, causes bad social influence without semantic text.
The training process of N-gram language models will be illustrated by Fig. 2 below.May include as follows in Fig. 2 Step:
Step 210, training sample set is obtained.
Training sample set herein is referred to as corpus, may include at least one training sample.Above-mentioned instruction It can be the text for the Chinese collected from Webpage or server in advance by artificial either server, English to practice sample The text of text and/or other target languages, the text may include:News content, Blog content, forum's content and/or chat Its content etc..It should be noted that for the universal semantic feature that most covering natural persons is used for information interchange, above-mentioned language material The scale in library is the bigger the better.Such as, in specific implementation process, the scale of corpus can reach 20,000,000,000 scales.In addition, in order to The text of non-targeted languages can be identified, in other words, in order to avoid by target language (e.g., normal English sentence) Text identification is low quality text, and the training sample of the target language of certain scale can be added in corpus, to ensure to be somebody's turn to do The word sequence of target language is more than enough, and the number of appearance is also enough big.Such as, it in specific implementation process, can be added in corpus 1000000000 target language is the training sample of English.
Step 220, each training sample in training sample set is pre-processed.
Herein, to training sample carry out pretreatment with it is aforementioned similar to the pretreated process of text to be identified progress, you can Think that the arbitrary combination of four steps described in step 120, the application do not repeat again herein.
Step 230, to pretreated each training sample, each word combination in training sample is determined.
Word combination herein may include at least one word.
It in one implementation, can be according to n-gram language models, to determine each word group in training sample It closes.Specifically, can be that each word combination in training sample is determined according to formula 1.In formula 1, i-th of word is being calculated When the probability score value that language occurs, need to count C (w simultaneouslyi-n+1,...,wi) and C (wi-n+1,...,wi-1).Thus may determine that Each word combination in training sample is respectively:wi-n+1,…,wiAnd wi-n+1,…,wi-1, i.e., since i-th of word Preceding n word and the preceding n-1 word since i-th of word, wherein the value range of i is [1, m], and m is training sample Included in word number.N determines according to the language model that specifically uses, for example, when using unigram language models, n =1.When using bi-gram language models, n=2.When using tri-gram language models, n=3, and so on.
It certainly, in practical applications, can also be only by wi-n+1,…,wiThe word combination being determined as in training sample, or Person can also also regard the single word in training sample as word combination etc., and the application is not construed as limiting this.
With n-gram language models for bi-gram language models, and each word combination is:wi-n+1,…,wiAnd wi-n+1,…,wi-1For for, it is assumed that certain training sample include m word, then each word combination in the training sample divide It is not:w0w1, w1w2, w2w3..., wm-1wmAnd w0, w1, w2..., wm-1.It should be noted that because w0For the 0th word, That is w0For there is no word, it is possible to preset w0And w0w1The number of appearance.
It should be noted that after determining each word combination of each training sample, can be combined with preset Set of words (also referred to as bag of words), to be filtered to word combination.Specifically, it can be determined that each word combination whether include In preset set of words, if being not included in preset set of words, the word combination is deleted.Herein, pass through introducing Bag of words can make n-gram language models more targeted, while reduce the parameter space size of language model.
In the specific implementation, which may include what State Council formulated《Chinese characters in current use specification》Totally 8105 Chinese Characters All digit strings by digital dimensionality reduction in symbol, 10,000 most common English words and corpus.
Step 240, the number that each word combination in each training sample occurs is counted.
It is understood that if each word combination has carried out delete processing, which only counts each training sample In the number that occurs of each word combination after delete processing.
Herein, can be specifically according to all training samples in corpus, to count time that each word combination occurs Number.In one implementation, MapReduce frames may be used and carry out distributed statistics, it is sharp in 10,000,000,000 grades of corpus It is counted with the ngramCount components of the PAI platforms of Ali's cloud.
After statistics obtains each word combination and number, that is, complete the training of N-gram language models.
The setting process of first function formula and second function formula will be illustrated by Fig. 3 below.In Fig. 3, It may include steps of:
Step 310, at least one sample of each training sample is chosen from each word combination in each training sample This word sequence.
Sample word sequence herein can refer to the word combination that n word is contained in each word combination, wherein n As previously described.Each word combination difference with n-gram language models for bi-gram language models, and in certain training sample For:w0w1, w1w2, w2w3..., wm-1wmAnd w0, w1, w2..., wm-1For for, because of n=2, choose the training At least one sample word order of sample is classified as:w0w1, w1w2, w2w3..., wm-1wm
Step 320, according to N-gram language models, the probability score of each sample word sequence of each training sample is determined Value.
It is understood that it is determined here that the probability score of the probability score value of sample word sequence and above-mentioned determining word sequence It is worth similar, does not repeat again herein.
Step 330, according to the probability score value of each sample word sequence of each training sample and each training sample In include sample word sequence number, determine the average probability fractional value and probability score standard deviation of each training sample.
The determination method class of the average probability fractional value of training text and the average probability fractional value of text to be identified is seemingly.Instruction The determination method class of the probability score standard deviation of white silk text and the probability score standard deviation of text to be identified is not seemingly, multiple herein It repeats.
Step 340, respectively according to the average probability fractional value of each training sample and probability score standard deviation, to each Training sample is ranked up.
Namely each training sample is ranked up according to the AvgScore and SdScore of each training sample respectively.Tool Body, positive sequence sequence can be carried out to each training sample according to the sequences of AvgScore from small to large.According to formula 3 and public affairs Formula 4 thinks that the training sample is more meaningless it is found that AvgScore is negative so the AvgScore of training sample is smaller.It can With the sequence according to SdScore from big to small, backward sequence is carried out to each training sample.Because of probability score standard deviation generation Table the fluctuation situation of sample word sequence in training sample, it is possible to think that the fluctuation of sample word sequence is very big namely it is general Rate score criteria difference is bigger, which may be abnormal text.
Step 350, according to ranking results, determine that the first function that average probability fractional value is normalized is public Formula, and determine the second function formula that probability score standard deviation is normalized.
For example, it for AvgScore, because its theoretical maximum is 24, by AvgScore divided by 24 and takes absolutely The positive number of 0-1 is converted into value, the cumulative distribution for the Beta distributions that with selection parameter can be 2 that then make discovery from observation Function is as the first function formula that AvgScore is normalized.Such as, which can be such as 6 institute of formula Show.
For SdScore, because its theoretical maximum is 7, SdScore divided by 7 are being converted into 0-1 just Number, then make discovery from observation can be respectively using selection parameter 5 and 1 Gamma distribution cumulative distribution function as pair The second function formula that SdScore is normalized.Such as, which can be as shown in formula 7.
It should be noted that the method for the lower cumulative distribution function of above-mentioned calculating Beta and Gamma distributions may be used The mathematical function library CommonMath that increases income of Apache Software Foundation.
Certainly, above-mentioned a kind of only method of determining first function formula and second function formula, in practical applications, After the average probability fractional value and the probability score standard deviation that determine each training sample, can also linearly it return according to simply One changes mode, to determine above-mentioned first function formula and second function formula.Namely after executing step 310- steps 330, Step 340 and step 350 can not be executed, and is directly determined first function formula as follows:
Wherein, AvgScorejFor the average probability fractional value being currently normalized, AvgScoreiFor i-th of training sample Average probability fractional value, n be corpus in training sample number.
Directly second function formula is determined as follows:
Wherein, SdScorejFor the probability score standard deviation being currently normalized, SdScoreiFor i-th of training sample Probability score standard deviation, n be corpus in training sample number.
It should be noted that in order to improve the recognition efficiency of no semantic text, the step of the model training of the application with really Surely the step of normalizing formula can execute parallel.It, can be with by corpus i.e. after being pre-processed to training sample It is divided into two set:Training set and test set.Each training sample in gathering training executes step 230- steps 240.And the training sample in gathering test, execute step 310- steps 350.Certainly, in practical applications, to above-mentioned test Set, can also artificially increase some include as keyboard arbitrarily taps, divide by means of characters either deformation content (such as " the Yi fruit ears fourth of the twelve Earthly Branches " or " fa human relations g0ng ") training sample.
To sum up, the model training of the application, normalization formula be determining and three parts of identification without semantic text can be with Individually implement, implementation can also be combined.When three parts are combined implementation, the technical solution of the application can also be such as Fig. 4 It is shown.Can be the training sample first collected in corpus in Fig. 4, which can be the text of the text of Chinese, English This can also be the text of other target languages.The training sample in corpus is pre-processed later, including:Removal Interference element, the converting of either traditional and simplified characters word, digital dimensionality reduction and clause such as split at the operations.Training sample is being carried out to pre-process it Afterwards, corpus can be divided into two set:Training set and test set, train the training sample in set can be used for Training n-gram language models.After training n-gram language models, it can determine that normalization is public in conjunction with the language model Formula.So far, model training part is partially completed with normalization formula determination.Text to be identified can be identified later .For text to be identified, the text to be identified is pre-processed first.It can collect later in conjunction with trained language model With the normalization formula determined, give a mark to text to be identified.Finally known according to marking result (i.e. composite score value) Whether text not to be identified is no semantic text.
In addition, the application also has the following advantages that:
1) it is not necessarily to manually be labeled the training sample of corpus, while it is different to be accurately directed to dividing by means of characters, mutation etc. Normal content is identified, and scheme processing speed is fast, can be quickly identified to magnanimity text to be identified, improve no semanteme The recognition efficiency and accuracy of text.
2) tissue can be carried out to the training sample of target language in corpus, and training is for the language of target language The text of model, non-targeted languages all in this way can be identified as no semantic text.
3) it can not only identify the text with semantic factor, can also identify that the input units such as keyboard arbitrarily tap low Quality text.
Recognition methods with above-mentioned no semantic text accordingly, a kind of no semantic text that the embodiment of the present application also provides Identification device, as shown in figure 5, the device includes:
Acquiring unit 501, for obtaining text to be identified.
Pretreatment unit 502, the text to be identified for being obtained to acquiring unit 501 pre-process.
Determination unit 503, each word sequence for determining the pretreated text to be identified of pretreatment unit 502.
Determination unit 503 is additionally operable to determine the probability score value of each word sequence according to N-gram language models.
Determination unit 503 is additionally operable to according to the probability score value of each word sequence and the number of word sequence, and determination waits knowing The average probability fractional value and/or probability score standard deviation of other text.
Determination unit 503 is additionally operable to, according to average probability fractional value and/or probability score standard deviation, determine to be identified The composite score value of text.
Recognition unit 504 will be to be identified when the composite score value for being determined when determination unit 503 meets preset condition Text identification is no semantic text.
Optionally, pretreatment unit 502 specifically can be used for:
Remove the interference element in text to be identified;
The complex form of Chinese characters in text to be identified is converted into simplified Chinese character;
Digit strings in text to be identified are converted into predetermined format;
Clause's fractionation is carried out to text to be identified.
Optionally, which can also include:
Training unit 505, for obtaining training sample set, training sample set includes at least one training sample.
Wherein, training sample may include:Text, the text of English and/or the text of other target languages of Chinese.On It may include news content, Blog content, forum's content and/or chat content etc. to state text.
Each training sample in training sample set is pre-processed.
To pretreated each training sample, each word combination in training sample is determined.
Count the number that each word combination in each training sample occurs.
Each word combination and number constitute N-gram language models.
Optionally, which can also include:
Judging unit 506, for each word combination in each word combination, judging whether word combination is included in In preset set of words.
Deleting unit 507 is deleted if judging that word combination is not included in preset set of words for judging unit 506 Except word combination.
Optionally, the number that each word combination in each training sample occurs is counted, including:
Count the number that each word combination after delete processing in each training sample occurs.
Optionally it is determined that unit 503 can also be specifically used for:
Average probability fractional value is normalized, the first handling result is obtained.
Probability score standard deviation is normalized, second processing result is obtained.
Compare the first handling result and second processing result.
When the first handling result is larger, according to the first handling result, the composite score value of text to be identified is determined.
When second processing result is larger, according to second processing as a result, determining the composite score value of text to be identified.
Wherein, according to the first handling result, the composite score value of text to be identified is determined, including:
Processing is amplified to the first handling result, the first handling result after enhanced processing is determined as text to be identified Composite score value.
According to second processing as a result, determining the composite score value of text to be identified, including:
Processing is amplified to second processing result, the second processing result after enhanced processing is determined as text to be identified Composite score value.
Optionally, which can also include:
Selection unit 508, for chosen from each word combination in each training sample each training sample to A few sample word sequence.
Determination unit 503 is additionally operable to determine each sample word sequence of each training sample according to N-gram language models Probability score value.
Determination unit 503 is additionally operable to according to the probability score value of each sample word sequence of each training sample and each The number for the sample word sequence for including in a training sample determines the average probability fractional value and probability score of each training sample Standard deviation.
Sequencing unit 509, the average probability fractional value of each training sample for being determined respectively according to determination unit 503 With probability score standard deviation, each training sample is ranked up.
Determination unit 503 is additionally operable to the ranking results according to sequencing unit 509, determines and is carried out to average probability fractional value The first function formula of normalized, and determine that the second function that probability score standard deviation is normalized is public Formula.
Optionally, average probability fractional value is normalized, including:
According to first function formula, average probability fractional value is normalized.
Probability score standard deviation is normalized, including:
According to second function formula, probability score standard deviation is normalized.
The function of each function module of the embodiment of the present application device, can be by each step of above method embodiment come real Existing, therefore, the specific work process of device provided by the present application does not repeat again herein.
The recognition methods of no semantic text provided by the present application and device, acquiring unit 501 obtain text to be identified.Pre- place Reason unit 502 pre-processes text to be identified.Determination unit 503 determines each word order of pretreated text to be identified Row.Determination unit 503 determines the probability score value of each word sequence according to N-gram language models.Determination unit 503 is according to each The probability score value of a word sequence and the number of word sequence determine the average probability fractional value and/or probability of text to be identified Score criteria difference.Determination unit 503 determines text to be identified according to average probability fractional value and/or probability score standard deviation This composite score value.When composite score value meets preset condition, text identification to be identified is no semanteme by recognition unit 504 Text.Thus, it is possible to improve the accuracy identified to no semantic text and comprehensive.
Those skilled in the art are it will be appreciated that in said one or multiple examples, work(described in the invention It can be realized with hardware, software, firmware or their arbitrary combination.It when implemented in software, can be by these functions Storage in computer-readable medium or as on computer-readable medium one or more instructions or code be transmitted.
Above-described specific implementation mode has carried out further the purpose of the present invention, technical solution and advantageous effect It is described in detail, it should be understood that the foregoing is merely the specific implementation mode of the present invention, is not intended to limit the present invention Protection domain, all any modification, equivalent substitution, improvement and etc. on the basis of technical scheme of the present invention, done should all Including within protection scope of the present invention.

Claims (16)

1. a kind of recognition methods of no semantic text, which is characterized in that including:
Obtain text to be identified;
The text to be identified is pre-processed;
Determine each word sequence of pretreated text to be identified;
According to N-gram language models, the probability score value of each word sequence is determined;
According to the probability score value of each word sequence and the number of the word sequence, the flat of the text to be identified is determined Equal probability score value and/or probability score standard deviation;
According to the average probability fractional value and/or the probability score standard deviation, the synthesis of the text to be identified is determined Fractional value;
It is no semantic text by the text identification to be identified when the composite score value meets preset condition.
2. according to the method described in claim 1, it is characterized in that, it is described to the text to be identified carry out pretreatment include with Under any one or more steps:
Remove the interference element in the text to be identified;
The complex form of Chinese characters in the text to be identified is converted into simplified Chinese character;
Digit strings in the text to be identified are converted into predetermined format;
Clause's fractionation is carried out to the text to be identified.
3. method according to claim 1 or 2, which is characterized in that further include:The step of the training N-gram language models Suddenly, including:
Training sample set is obtained, the training sample set includes at least one training sample;
Each training sample in the training sample set is pre-processed;
To pretreated each training sample, each word combination in the training sample is determined;
Count the number that each word combination in each training sample occurs;
Each word combination and the number constitute the N-gram language models.
4. according to the method described in claim 3, it is characterized in that, the training sample includes:
Text, the text of English and/or the text of other target languages of Chinese;The text includes news content, in blog Appearance, forum's content and/or chat content.
5. according to the method described in claim 3, it is characterized in that, each word group in the determination training sample After conjunction, further include:
To each word combination in each word combination, judge whether the word combination is included in preset word collection In conjunction, if being not included in the preset set of words, the word combination is deleted;
The number that each word combination in statistics each training sample occurs, including:
Count the number that each word combination after delete processing in each training sample occurs.
6. according to the method described in claim 3, it is characterized in that, described according to the average probability fractional value and the probability Score criteria difference determines the composite score value of the text to be identified, including:
The average probability fractional value is normalized, the first handling result is obtained;
The probability score standard deviation is normalized, second processing result is obtained;
Compare first handling result and the second processing result;
When first handling result is larger, according to first handling result, the synthesis point of the text to be identified is determined Numerical value;
When the second processing result is larger, according to the second processing as a result, determining the synthesis point of the text to be identified Numerical value.
7. according to the method described in claim 6, it is characterized in that,
It is described that the composite score value of the text to be identified is determined according to first handling result, including:
Processing is amplified to first handling result, the first handling result after enhanced processing is determined as described to be identified The composite score value of text;
It is described according to the second processing as a result, determine the composite score value of the text to be identified, including:
Processing is amplified to the second processing result, the second processing result after enhanced processing is determined as described to be identified The composite score value of text.
8. the method described according to claim 6 or 7, which is characterized in that further include:
At least one sample word of each training sample is chosen from each word combination in each training sample Sequence;
According to the N-gram language models, the probability score value of each sample word sequence of each training sample is determined;
It is wrapped according in the probability score value of each sample word sequence of each training sample and each training sample The number of the sample word sequence contained determines the average probability fractional value and probability score standard deviation of each training sample;
Respectively according to the average probability fractional value and probability score standard deviation of each training sample, to each training Sample is ranked up;
According to ranking results, the first function formula that the average probability fractional value is normalized is determined, and determine The second function formula that the probability score standard deviation is normalized;
It is described that the average probability fractional value is normalized, including:
According to the first function formula, the average probability fractional value is normalized;
It is described that the probability score standard deviation is normalized, including:
According to the second function formula, the probability score standard deviation is normalized.
9. a kind of identification device of no semantic text, which is characterized in that including:
Acquiring unit, for obtaining text to be identified;
Pretreatment unit, for being pre-processed to the text to be identified that the acquiring unit obtains;
Determination unit, each word sequence for determining the pretreated text to be identified of the pretreatment unit;
The determination unit is additionally operable to determine the probability score value of each word sequence according to N-gram language models;
The determination unit is additionally operable to the number of the probability score value and the word sequence according to each word sequence, really The average probability fractional value and/or probability score standard deviation of the fixed text to be identified;
The determination unit is additionally operable to, according to the average probability fractional value and/or the probability score standard deviation, determine institute State the composite score value of text to be identified;
Recognition unit waits knowing when the composite score value for determining when the determination unit meets preset condition by described Other text identification is no semantic text.
10. device according to claim 9, which is characterized in that the pretreatment unit is specifically used for:
Remove the interference element in the text to be identified;
The complex form of Chinese characters in the text to be identified is converted into simplified Chinese character;
Digit strings in the text to be identified are converted into predetermined format;
Clause's fractionation is carried out to the text to be identified.
11. device according to claim 9 or 10, which is characterized in that further include:
Training unit, for obtaining training sample set, the training sample set includes at least one training sample;
Each training sample in the training sample set is pre-processed;
To pretreated each training sample, each word combination in the training sample is determined;
Count the number that each word combination in each training sample occurs;
Each word combination and the number constitute the N-gram language models.
12. according to the devices described in claim 11, which is characterized in that the training sample includes:
Text, the text of English and/or the text of other target languages of Chinese;The text includes news content, in blog Appearance, forum's content and/or chat content.
13. according to the devices described in claim 11, which is characterized in that further include:
Judging unit, for each word combination in each word combination, judge the word combination whether include In preset set of words;
Deleting unit, if judging that the word combination is not included in the preset set of words for the judging unit, Delete the word combination;
The number that each word combination in statistics each training sample occurs, including:
Count the number that each word combination after delete processing in each training sample occurs.
14. according to the devices described in claim 11, which is characterized in that the determination unit is specifically used for:
The average probability fractional value is normalized, the first handling result is obtained;
The probability score standard deviation is normalized, second processing result is obtained;
Compare first handling result and the second processing result;
When first handling result is larger, according to first handling result, the synthesis point of the text to be identified is determined Numerical value;
When the second processing result is larger, according to the second processing as a result, determining the synthesis point of the text to be identified Numerical value.
15. device according to claim 14, which is characterized in that
It is described that the composite score value of the text to be identified is determined according to first handling result, including:
Processing is amplified to first handling result, the first handling result after enhanced processing is determined as described to be identified The composite score value of text;
It is described according to the second processing as a result, determine the composite score value of the text to be identified, including:
Processing is amplified to the second processing result, the second processing result after enhanced processing is determined as described to be identified The composite score value of text.
16. the device according to claims 14 or 15, which is characterized in that further include:
Selection unit, for chosen from each word combination in each training sample each training sample to A few sample word sequence;
The determination unit is additionally operable to determine each sample of each training sample according to the N-gram language models The probability score value of word sequence;
The determination unit is additionally operable to probability score value and the institute of each sample word sequence according to each training sample The number for stating the sample word sequence for including in each training sample, determine each training sample average probability fractional value and Probability score standard deviation;
Sequencing unit, for respectively according to the determination unit determine each training sample average probability fractional value and Probability score standard deviation is ranked up each training sample;
The determination unit, is additionally operable to the ranking results according to the sequencing unit, determine to the average probability fractional value into The first function formula of row normalized, and determine the second letter that the probability score standard deviation is normalized Number formula;
It is described that the average probability fractional value is normalized, including:
According to the first function formula, the average probability fractional value is normalized;
It is described that the probability score standard deviation is normalized, including:
According to the second function formula, the probability score standard deviation is normalized.
CN201710182218.7A 2017-03-24 2017-03-24 Semantic-free text recognition method and device Active CN108628822B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710182218.7A CN108628822B (en) 2017-03-24 2017-03-24 Semantic-free text recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710182218.7A CN108628822B (en) 2017-03-24 2017-03-24 Semantic-free text recognition method and device

Publications (2)

Publication Number Publication Date
CN108628822A true CN108628822A (en) 2018-10-09
CN108628822B CN108628822B (en) 2021-12-07

Family

ID=63707631

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710182218.7A Active CN108628822B (en) 2017-03-24 2017-03-24 Semantic-free text recognition method and device

Country Status (1)

Country Link
CN (1) CN108628822B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110493088A (en) * 2019-09-24 2019-11-22 国家计算机网络与信息安全管理中心 A kind of mobile Internet traffic classification method based on URL
CN111681670A (en) * 2019-02-25 2020-09-18 北京嘀嘀无限科技发展有限公司 Information identification method and device, electronic equipment and storage medium
CN112633009A (en) * 2020-12-29 2021-04-09 扬州大学 Identification method for random combination uploading field
CN113094543A (en) * 2021-04-27 2021-07-09 杭州网易云音乐科技有限公司 Music authentication method, device, equipment and medium
CN113378541A (en) * 2021-05-21 2021-09-10 标贝(北京)科技有限公司 Text punctuation prediction method, device, system and storage medium
CN113657118A (en) * 2021-08-16 2021-11-16 北京好欣晴移动医疗科技有限公司 Semantic analysis method, device and system based on call text
CN115374779A (en) * 2022-10-25 2022-11-22 北京海天瑞声科技股份有限公司 Text language identification method, device, equipment and medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090119235A1 (en) * 2006-05-31 2009-05-07 International Business Machines Corporation System and method for extracting entities of interest from text using n-gram models
CN103605690A (en) * 2013-11-04 2014-02-26 北京奇虎科技有限公司 Device and method for recognizing advertising messages in instant messaging
CN103942191A (en) * 2014-04-25 2014-07-23 中国科学院自动化研究所 Horrific text recognizing method based on content

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090119235A1 (en) * 2006-05-31 2009-05-07 International Business Machines Corporation System and method for extracting entities of interest from text using n-gram models
CN103605690A (en) * 2013-11-04 2014-02-26 北京奇虎科技有限公司 Device and method for recognizing advertising messages in instant messaging
CN103942191A (en) * 2014-04-25 2014-07-23 中国科学院自动化研究所 Horrific text recognizing method based on content

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111681670A (en) * 2019-02-25 2020-09-18 北京嘀嘀无限科技发展有限公司 Information identification method and device, electronic equipment and storage medium
CN111681670B (en) * 2019-02-25 2023-05-12 北京嘀嘀无限科技发展有限公司 Information identification method, device, electronic equipment and storage medium
CN110493088A (en) * 2019-09-24 2019-11-22 国家计算机网络与信息安全管理中心 A kind of mobile Internet traffic classification method based on URL
CN110493088B (en) * 2019-09-24 2021-06-01 国家计算机网络与信息安全管理中心 Mobile internet traffic classification method based on URL
CN112633009A (en) * 2020-12-29 2021-04-09 扬州大学 Identification method for random combination uploading field
CN113094543A (en) * 2021-04-27 2021-07-09 杭州网易云音乐科技有限公司 Music authentication method, device, equipment and medium
CN113378541A (en) * 2021-05-21 2021-09-10 标贝(北京)科技有限公司 Text punctuation prediction method, device, system and storage medium
CN113378541B (en) * 2021-05-21 2023-07-07 标贝(北京)科技有限公司 Text punctuation prediction method, device, system and storage medium
CN113657118A (en) * 2021-08-16 2021-11-16 北京好欣晴移动医疗科技有限公司 Semantic analysis method, device and system based on call text
CN113657118B (en) * 2021-08-16 2024-05-14 好心情健康产业集团有限公司 Semantic analysis method, device and system based on call text
CN115374779A (en) * 2022-10-25 2022-11-22 北京海天瑞声科技股份有限公司 Text language identification method, device, equipment and medium
CN115374779B (en) * 2022-10-25 2023-01-10 北京海天瑞声科技股份有限公司 Text language identification method, device, equipment and medium

Also Published As

Publication number Publication date
CN108628822B (en) 2021-12-07

Similar Documents

Publication Publication Date Title
CN108628822A (en) Recognition methods without semantic text and device
US10643182B2 (en) Resume extraction based on a resume type
JP5379138B2 (en) Creating an area dictionary
CN110287309B (en) Method for quickly extracting text abstract
CN107506389B (en) Method and device for extracting job skill requirements
CN112395385B (en) Text generation method and device based on artificial intelligence, computer equipment and medium
CN110019742B (en) Method and device for processing information
CN103869998B (en) A kind of method and device being ranked up to candidate item caused by input method
CN107688630B (en) Semantic-based weakly supervised microbo multi-emotion dictionary expansion method
US10474747B2 (en) Adjusting time dependent terminology in a question and answer system
CN111930792B (en) Labeling method and device for data resources, storage medium and electronic equipment
CN109508441B (en) Method and device for realizing data statistical analysis through natural language and electronic equipment
CN106708940A (en) Method and device used for processing pictures
CN113220835A (en) Text information processing method and device, electronic equipment and storage medium
CN112328735A (en) Hot topic determination method and device and terminal equipment
CN113743090A (en) Keyword extraction method and device
CN112559711A (en) Synonymous text prompting method and device and electronic equipment
CN115860009B (en) Sentence embedding method and system for contrast learning by introducing auxiliary sample
Criscuolo et al. Discriminating between similar languages with word-level convolutional neural networks
WO2019192122A1 (en) Document topic parameter extraction method, product recommendation method and device, and storage medium
CN113627722B (en) Simple answer scoring method based on keyword segmentation, terminal and readable storage medium
CN114842982A (en) Knowledge expression method, device and system for medical information system
CN112784599B (en) Method and device for generating poem, electronic equipment and storage medium
CN114357996A (en) Time sequence text feature extraction method and device, electronic equipment and storage medium
CN112817996A (en) Illegal keyword library updating method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20201016

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant before: Advanced innovation technology Co.,Ltd.

Effective date of registration: 20201016

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Advanced innovation technology Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Applicant before: Alibaba Group Holding Ltd.

GR01 Patent grant
GR01 Patent grant