CN108628822B - Semantic-free text recognition method and device - Google Patents

Semantic-free text recognition method and device Download PDF

Info

Publication number
CN108628822B
CN108628822B CN201710182218.7A CN201710182218A CN108628822B CN 108628822 B CN108628822 B CN 108628822B CN 201710182218 A CN201710182218 A CN 201710182218A CN 108628822 B CN108628822 B CN 108628822B
Authority
CN
China
Prior art keywords
text
word
probability score
recognized
training sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710182218.7A
Other languages
Chinese (zh)
Other versions
CN108628822A (en
Inventor
江南
祝慧佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Advantageous New Technologies Co Ltd
Original Assignee
Advanced New Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Advanced New Technologies Co Ltd filed Critical Advanced New Technologies Co Ltd
Priority to CN201710182218.7A priority Critical patent/CN108628822B/en
Publication of CN108628822A publication Critical patent/CN108628822A/en
Application granted granted Critical
Publication of CN108628822B publication Critical patent/CN108628822B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

The application relates to the technical field of computers, in particular to a method and a device for recognizing a semantic-free text. Determining each word sequence of the preprocessed text to be recognized, and determining the probability score value of each word sequence according to the N-gram language model. And determining the average probability score value and/or the standard deviation value of the probability score of the text to be recognized according to the probability score value of each word sequence and the number of the word sequences. And determining a comprehensive score value of the text to be recognized according to the average probability score value and/or the standard deviation value of the probability score. And when the comprehensive score value meets a preset condition, identifying the text to be identified as the text without semantic meaning. Therefore, the accuracy and comprehensiveness of the semantic-free text recognition can be improved.

Description

Semantic-free text recognition method and device
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for recognizing a semantic-free text.
Background
In the conventional technology, the semantic-free text is mainly recognized by the following two methods:
the first method is a supervised machine learning method, that is, some semantic features of the nonsense text, such as variant words, special symbols, etc., are manually collected in advance, and whether the sample in the corpus is the nonsense text is marked, then an identification model is trained by using the corpus and the semantic features, and finally the identification model is used to identify whether the input text is the nonsense text. However, in this method, if the input text includes semantic features that are not collected, or if the input text includes deformed semantic features, the input text cannot be recognized as a semantic-free text, which affects the accuracy of semantic-free text recognition. In addition, the method usually requires a lot of manpower to label the samples in the corpus, which affects the efficiency of the semantic-free text recognition.
The second method is a method for calculating similarity, that is, content texts reported by users or obtained through other channels are archived to generate a sample library, and then similarity between the input text and the content texts in the sample library is calculated to identify whether the input text is a semantic-free text. However, the method can only identify the content text that has already appeared, and cannot identify a new type of content text, and today with increasing information expansion, it is impossible to enumerate all content texts by a manual method, so that the method cannot identify all input texts, that is, the second method cannot identify the semanteme-free text comprehensively.
Disclosure of Invention
The application describes a method and a device for recognizing a semantic-free text, which can improve the accuracy and comprehensiveness of recognizing the semantic-free text.
In a first aspect, a method for identifying a semantic-free text is provided, which includes:
acquiring a text to be identified;
preprocessing the text to be recognized;
determining each word sequence of the preprocessed text to be recognized;
determining the probability score value of each word sequence according to the N-gram language model;
determining an average probability score value and/or a probability score standard difference value of the text to be recognized according to the probability score value of each word sequence and the number of the word sequences;
determining a comprehensive score value of the text to be recognized according to the average probability score value and/or the probability score standard deviation value;
and when the comprehensive score value meets a preset condition, identifying the text to be identified as a semantic-free text.
In a second aspect, an apparatus for recognizing semantic-free text is provided, which includes:
the acquiring unit is used for acquiring a text to be recognized;
the preprocessing unit is used for preprocessing the text to be recognized acquired by the acquiring unit;
the determining unit is used for determining each word sequence of the text to be recognized after the preprocessing unit preprocesses the text;
the determining unit is further configured to determine a probability score value of each word sequence according to an N-gram language model;
the determining unit is further configured to determine an average probability score value and/or a standard deviation of the probability scores of the texts to be recognized according to the probability score values of the word sequences and the number of the word sequences;
the determining unit is further configured to determine a comprehensive score value of the text to be recognized according to the average probability score value and/or the standard deviation of the probability score;
and the identification unit is used for identifying the text to be identified as the semanteme-free text when the comprehensive score value determined by the determination unit meets a preset condition.
The semantic-free text recognition method and device can be used for obtaining the text to be recognized and preprocessing the text to be recognized. Determining each word sequence of the preprocessed text to be recognized, and determining the probability score value of each word sequence according to the N-gram language model. And determining the average probability score value and/or the standard deviation value of the probability score of the text to be recognized according to the probability score value of each word sequence and the number of the word sequences. And determining a comprehensive score value of the text to be recognized according to the average probability score value and/or the standard deviation value of the probability score. And when the comprehensive score value meets a preset condition, identifying the text to be identified as the text without semantic meaning. Therefore, the accuracy and comprehensiveness of the semantic-free text recognition can be improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flow chart of a method for recognizing semantic-free text according to an embodiment of the present application;
FIG. 2 is a flow chart of a method of model training provided herein;
FIG. 3 is a flow chart of a method for determining a normalization formula provided herein;
FIG. 4 is a schematic diagram of a semantic-free text recognition method provided in the present application;
fig. 5 is a schematic diagram of a semantic-free text recognition apparatus according to another embodiment of the present application.
Detailed Description
Embodiments of the present application are described below with reference to the accompanying drawings.
The semantic-free text recognition method is suitable for scenes for recognizing low-quality texts (also called semantic-free texts or junk texts). Low quality text herein includes, but is not limited to, the following: randomly knocking the generated text by an input device such as a keyboard; text containing content related to yellow storm-related politics, and text in non-target languages (e.g., japanese, korean, russian, etc.), and the like.
Fig. 1 is a flowchart of a semantic-free text recognition method according to an embodiment of the present application. The execution subject of the method may be a device with processing capabilities: as shown in fig. 1, the method may specifically include:
and step 110, acquiring a text to be recognized.
The text to be recognized can be a Chinese text or an English text. Of course, the text may be in other target languages (korean, japanese, etc.).
And step 120, preprocessing the text to be recognized.
Here, preprocessing the text to be recognized includes any one or more of the following steps: removing interference elements in the text to be recognized. Wherein, the interference element may include: emoticons (e.g., emoji stored in unicode code or as "^ a", etc.), Uniform resource Locator (url) addresses, etc. In one implementation, regular expressions may be utilized to remove interfering elements in the text to be recognized. And secondly, converting the traditional characters in the recognized text into simplified characters. For example, the traditional characters in the text to be recognized can be converted into simplified characters according to the traditional and simplified corresponding word library, so that the size of the parameter space of the language model can be reduced. And thirdly, converting the numeric characters in the text to be recognized into a preset format. The number here may refer to a telephone number or an amount of money or the like in the text to be recognized. In one implementation, the purpose of numeric dimension reduction can be achieved by extracting a numeric character string from the text to be recognized by using a regular expression and converting the numeric character string into a form such as "Num + { length of numeric character string }", wherein "Num" is used for indicating that the current character string is a numeric character string. For example, the numeric string "123456" may be converted to Num6, and the numeric string 1111111111 may be converted to Num 10. Fourthly, splitting the clauses of the text to be recognized. For example, punctuation (e.g., commas, periods, question marks, etc.) may be utilized to break the text to be recognized into clauses. In one example, the function of clause splitting can be accomplished using built-in functions of the Java language. It should be noted that the purpose of distributed processing by a computer can be achieved by splitting clauses of the text to be recognized.
It should be noted that the four steps of the above preprocessing can be flexibly combined and used in combination with the actual situation. For example, when the text to be recognized is an english text, since there are no complex characters in the english text, only the three steps of (r), (c), and (r) may be performed. In addition, each step can also be flexibly changed in combination with the actual situation, for example, if the url in the input text is required to be identified, the step of removing the url address can not be executed.
It should be further noted that the preprocessing in the present application is not limited to the above four steps, and may also include other steps, such as word segmentation, and the like, which is not limited in the present application.
And step 130, determining each word sequence of the preprocessed text to be recognized.
Specifically, the word combinations in the text to be recognized may be determined first according to the N-gram language model. And then selecting a word combination containing n words from each word combination. Wherein, when the N-gram language model is unigram, N is 1. When the language model of the N-gram is bi-gram, N is 2. When the language model of the N-gram is a tri-gram, N is 3. And so on. And taking the word combination containing n words as a word sequence of the text to be recognized. The process of determining each word combination in the text to be recognized will be described in the following description of the process of training the N-gram language model.
And step 140, determining probability score values of the word sequences according to the N-gram language model.
The n-gram language model is a statistical language model and can be generated based on large-scale training samples. The n-gram language model is based on the assumption that the nth word occurs in relation to only the first n-1 words and not any other words. Therefore, the probability score value of the nth word occurrence can be calculated according to the number of the previous n-1 words simultaneously occurring in the corpus. And the probability score value of a sentence occurrence is the product of the probability score values of the respective words occurrence. For the above n-gram language model, when n ═ 1, the language model may be referred to as unigram. When n is 2, the language model may be referred to as a bi-gram. When n is 3, the language model may be referred to as a tri-gram, and so on.
The above theory is interpreted by the mathematical formula as follows:
Figure BDA0001253862140000051
wherein, wiIs the ith word, wi-n+1,…,wi-1Is the first n-1 words of the ith word, p (w)i|wi-n+1,...,wi-1) Is the probability score value of the occurrence of the ith word, C (w)i-n+1,...,wi) The number of times that n words including the ith word appear in the corpus simultaneously, C (w)i-n+1,...,wi-1) The number of times the first n-1 words of the ith word appear in the corpus simultaneously.
For a word w1、w2、w3,…,wmThe probability score value of the formed sentence T which can appear under the n-gram language model is as follows:
Figure BDA0001253862140000061
because the parameter space of the n-gram language model is very large and theoretically is an exponential data scale of the number of words, in practice, low-frequency words with low occurrence frequency are usually filtered out according to the twenty-eight rule or a dictionary is configured in advance to limit which words are only counted when the occurrence frequency of the words is counted, and meanwhile, when the probability score value of the occurrence of the whole sentence is calculated, in order to ensure that the probability score value of the occurrence of the whole sentence is not 0 because one word does not exist (namely the probability is 0), algorithms such as Laplace algorithm, Kneser-Ney, StidiBackoff and the like can be adopted for smoothing. Finally, the probability value of the whole sentence is very small and can reach a few bits after a decimal point, so that the numerical value amplification processing is carried out by adopting mathematical means such as a logarithmic function and the like.
In one example, the problem of calculating the probability score value of a sentence occurrence is converted to accumulating the probability score value of each word sequence occurrence, the probability score value being less than 0, and the formula is as follows:
Figure BDA0001253862140000062
wherein T is the whole sentence, score (T) is the probability score value of the occurrence of the sentence T, w1,w2,...,wmIs all m words in sentence T.
The N-gram language model may be trained prior to performing step 140. The specific training process is described subsequently.
In one example, the N-gram language model may be expressed in the following format:
"number of times a word combination appears in a corpus by word combination spacing"
For example, it can be expressed as:
unigram:
w1 10
w2 10
w3 10
……
wn 10
bi-gram:
w1w2 5
w1w3 5
w2w3 5
……
wn-1wn 10
tri-gram:
w1w2w3 2
w1w3w4 3
w2w3wn 5
……
wn-2wn-1wn 8
as can be seen from the above description, each language model is described in this example.
Returning to step 140, step 140 may specifically be: the probability score values for the respective word sequences are determined according to equation 1. To determine a word sequence w1w2For example, the probability score of the word sequence is:
Figure BDA0001253862140000071
wherein, p (w)2|w1) As a sequence of words w1w2The value of the probability score of, C (w)1w2) Is a word combination w1w2Number of occurrences, C (w)1) Is a word combination w1The number of occurrences. Because the above-mentioned N-gram language model includes word combination w1w2And w1The number of occurrences, therefore, the word sequence w can be directly determined1w2The probability score value of (a). According to the above method for determining the probability score value of one word sequence, the probability score values of all the word sequences can be determined.
And 150, determining an average probability score value and/or a probability score standard deviation value of the text to be recognized according to the probability score value of each word sequence and the number of the word sequences.
In one example, the average probability score value may be calculated according to equation 4.
Figure BDA0001253862140000081
Wherein, T is a text to be recognized, avgscore (T) is an average probability score value of the text T to be recognized, score (T) is a probability score value of the text T to be recognized, and count (T) is the number of word sequences contained in the text T to be recognized. Score (T) can be calculated according to formula 3, that is, score (T) can be obtained after logarithm of the probability score of each word sequence of T is summed.
It should be noted that, in the present application, the open source mathematical function library CommonMath of the Apache software foundation may be adopted to calculate the standard deviation value SdScore of the probability score of the text to be recognized.
And step 160, determining a comprehensive score value of the text to be recognized according to the average probability score value and/or the standard deviation value of the probability score.
It should be noted that step 160 includes three cases: in the first case, the value of the composite score of the text to be recognized is determined according to the average probability score value. Specifically, the average probability score value may be directly determined as the comprehensive score value of the text to be recognized; or firstly, carrying out normalization processing on the average probability score value according to a preset first function formula to obtain a first processing result, and determining the first processing result as a comprehensive score value of the text to be recognized; it may also be that after the first processing result is obtained, the first processing result may be enlarged (for example, enlarged by 100 times), and the enlarged first processing result is determined as the composite score value of the text to be recognized. And in the second case, determining a comprehensive score value of the text to be recognized according to the standard difference value of the probability scores. Specifically, the probability score standard difference may be directly determined as a comprehensive score value of the text to be recognized; or firstly, carrying out normalization processing on the probability score standard difference value according to a preset second function formula to obtain a second processing result, and determining the second processing result as a comprehensive score value of the text to be recognized; after the second processing result is obtained, the second processing result may be enlarged (for example, enlarged by 100 times), and the enlarged second processing result may be determined as the composite score value of the text to be recognized. And in the third situation, determining a comprehensive score value of the text to be recognized according to the average probability score value and the standard deviation value of the probability score. In a specific implementation manner, the third case can be realized by the following steps:
and step A, carrying out normalization processing on the average probability fraction value to obtain a first processing result.
Here, the average probability score value may be normalized according to a preset first function formula.
And B, carrying out normalization processing on the probability score standard difference value to obtain a second processing result.
Similarly, the probability score standard deviation value may be normalized according to a preset second function formula.
It should be noted that the setting process of the first function formula and the second function formula is described later.
And C, comparing the first processing result with the second processing result.
And D, when the first processing result is larger, determining the comprehensive score value of the text to be recognized according to the first processing result.
In a specific implementation manner, the first processing result can be directly determined as the comprehensive score value of the text to be recognized. In another specific implementation, the first processing result may be amplified (for example, amplified by 100 times). And then determining the first processing result after the amplification processing as a comprehensive score value of the text to be recognized.
And E, when the second processing result is larger, determining the comprehensive score value of the text to be recognized according to the second processing result.
In a specific implementation manner, the second processing result can be directly determined as the comprehensive score value of the text to be recognized. In another specific implementation, the second processing result may be amplified (for example, amplified by 100 times). And then determining the second processing result after the amplification processing as the comprehensive score value of the text to be recognized.
In one example, when the first process result or the second process result is enlarged by 100 times, the above steps C to E may be expressed as equation 5.
GloScore(S)=100*max(F1(AvgScore(S)),F2(SdScore (S)) (equation 5)
Wherein S is a text to be recognized, GloScore (S) is a comprehensive score value of the text to be recognized, and F1(AvgScore (S)) is the first processing result, F2(SdsScore (S)) is the second processing result.
And 170, when the comprehensive score value meets a preset condition, identifying the text to be identified as the text without semantic meaning.
In one implementation, the threshold may be preset. Specifically, when the composite score value is smaller than the threshold value, it is determined that the composite score value satisfies a preset condition, so that the text to be recognized can be recognized as a semantic-free text.
It should be noted that the semantic-free text is identified based on two data of the average probability score value and/or the standard deviation value of the probability score of the text, and the idea of normalizing the two values and amplifying the data is provided, so that the problem that in the conventional technology, when the length of the text is long, the calculated probability score value is small due to more word sequences contained in the text, and the text cannot be accurately identified can be solved. In addition, the problem that potential risk users can break through the recognition model through word splitting or variation and the like in the traditional technology so as to release the semantic-free text can be solved. Even if the identification model is updated subsequently, the identification model is delayed, so that the semantic-free text cannot be intercepted in time, and adverse social effects are caused.
The training process for the N-gram language model will be described below with reference to FIG. 2. In fig. 2, the following steps may be included:
step 210, a training sample set is obtained.
The set of training samples herein, which may also be referred to as a corpus, may include at least one training sample. The training samples may be texts in chinese, english and/or other target languages collected manually or by a server from a webpage or the server in advance, where the texts may include: news content, blog content, forum content, and/or chat content, among others. In order to cover the general semantic features used by natural people for information exchange at most, the corpus is preferably larger in scale. For example, in a particular implementation, the size of the corpus can reach 200 billion. In addition, in order to recognize the text in the non-target language, in other words, in order to avoid recognizing the text in the target language (e.g., normal english sentences) as a low-quality text, a training sample of the target language may be added to the corpus in a certain scale to ensure that the word sequence of the target language is sufficient and the occurrence frequency is also large enough. For example, in a specific implementation, one billion training samples of the target language english can be added to the corpus.
Step 220, preprocessing each training sample in the training sample set.
Here, the process of preprocessing the training sample is similar to the process of preprocessing the text to be recognized, which may be any combination of the four steps in step 120, and this application is not repeated herein.
And step 230, determining each word combination in the training samples for each preprocessed training sample.
A word combination herein may include at least one word.
In one implementation, the various word combinations in the training sample can be determined according to an n-gram language model. Specifically, the word combinations in the training sample may be determined according to equation 1. In formula 1, C (w) needs to be counted at the same time when calculating the probability score value of the ith word occurrencei-n+1,...,wi) And C (w)i-n+1,...,wi-1). Therefore, it can be determined that the word combinations in the training sample are: w is ai-n+1,…,wiAnd wi-n+1,…,wi-1I.e. the first n words starting from the ith word and the first n-1 words starting from the ith word, wherein the value range of i is [1, m]And m is the number of words contained in the training sample. n is determined according to the particular language model employed, e.g., when employing the unigram language model,n is 1. When the bi-gram language model is adopted, n is 2. When the tri-gram language model is adopted, n is 3, and so on.
Of course, in practical application, only w may be usedi-n+1,…,wiThe word combinations in the training samples are determined, or the single word in the training samples can also be used as the word combinations, and the like, which are not limited in this application.
Taking the n-gram language model as a bi-gram language model, and combining all words into a word formula: w is ai-n+1,…,wiAnd wi-n+1,…,wi-1For example, assuming that a training sample contains m words, the word combinations in the training sample are: w is a0w1,w1w2,w2w3,…,wm-1wmAnd w0,w1,w2,…,wm-1. Note that, because w0Is the 0 th word, i.e. w0Since there is no word, w can be set in advance0And w0w1The number of occurrences.
It should be noted that after determining each word combination of each training sample, the word combination may be filtered in combination with a preset word set (also referred to as a word bag). Specifically, it may be determined whether each word combination is included in the preset word set, and if not, the word combination is deleted. Here, the bag of words is introduced, so that the n-gram language model is more targeted, and the size of the parameter space of the language model is reduced.
In specific implementation, the word bag may include 8105 kanji characters, ten thousand most commonly used english words, and all numeric character strings in the corpus subjected to numeric dimension reduction, which are set by the state department.
And 240, counting the occurrence times of each word combination in each training sample.
It can be understood that, if each word combination is subjected to deletion processing, the step only counts the occurrence times of each word combination subjected to deletion processing in each training sample.
Here, the number of times of occurrence of each word combination may be specifically counted according to all training samples in the corpus. In one implementation, a MapReduce framework can be used for distributed statistics, and the ngramCount component of the PAI platform of Alice cloud is used for statistics in a billion-level corpus.
And after the word combinations and times are obtained through statistics, the training of the N-gram language model is completed.
The setting process of the first and second function formulas will be described below with reference to fig. 3. In fig. 3, the following steps may be included:
at step 310, at least one sample word sequence of each training sample is selected from each word combination in each training sample.
The sample word sequence herein may refer to a word combination including n words in each word combination, where n is as described above. Taking an n-gram language model as a bi-gram language model, wherein each word combination in a certain training sample is as follows: w is a0w1,w1w2,w2w3,…,wm-1wmAnd w0,w1,w2,…,wm-1For example, since n is 2, at least one sample word sequence of the training sample is selected as: w is a0w1,w1w2,w2w3,…,wm-1wm
And step 320, determining the probability fraction value of each sample word sequence of each training sample according to the N-gram language model.
It is to be understood that the probability score value for determining the sample word sequence is similar to the probability score value for determining the word sequence, and the description thereof is omitted here.
Step 330, determining the average probability score value and the standard deviation value of the probability score of each training sample according to the probability score value of each sample word sequence of each training sample and the number of sample word sequences contained in each training sample.
The average probability score value of the training text is similar to the determination method of the average probability score value of the text to be recognized. The determination method of the standard deviation value of the probability score of the training text is similar to that of the standard deviation value of the probability score of the text to be recognized, and the details are not repeated here.
And 340, sequencing the training samples according to the average probability score value and the standard difference value of the probability score of each training sample.
I.e. the training samples are ranked according to their avgcore and SdScore, respectively. Specifically, the training samples may be sorted in positive order in order of AvgScore from small to large. From equations 3 and 4, the AvgScore is negative, so the smaller the AvgScore of a training sample, the less meaningful the training sample is. The training samples may be sorted in reverse order from the greater to the lesser of SdScore. Because the standard deviation value of the probability score represents the fluctuation condition of the sample word sequence in the training sample, the fluctuation of the sample word sequence can be considered to be large, that is, the larger the standard deviation value of the probability score, the training sample may be abnormal text.
Step 350, according to the sorting result, determining a first function formula for performing normalization processing on the average probability score value, and determining a second function formula for performing normalization processing on the probability score standard deviation value.
For example, for AvgScore, since its theoretical maximum is 24, dividing AvgScore by 24 and taking the absolute value and converting it to a positive number of 0-1, then it was found by observation that the cumulative distribution function of the Beta distribution with parameters of 2 could be chosen as the first functional formula for normalizing AvgScore. For example, the first function formula may be as shown in formula 6.
Figure BDA0001253862140000131
For SdScore, since its theoretical maximum is 7, dividing SdScore by 7 converts it to a positive number of 0-1, and then, by observation, it is found that the cumulative distribution function of the Gamma distribution with parameters of 5 and 1, respectively, can be selected as the second functional formula for normalizing SdScore. For example, the second function formula may be as shown in formula 7.
Figure BDA0001253862140000141
It should be noted that the method for calculating the cumulative distribution function under Beta and Gamma distributions may use commonmaph of the Apache software foundation.
Of course, the above is only one method for determining the first function formula and the second function formula, and in practical applications, after determining the average probability score value and the standard deviation value of the probability score of each training sample, the first function formula and the second function formula may also be determined according to a simple linear normalization manner. That is, after performing steps 310-330, step 340 and step 350 may not be performed, and the first function formula is directly determined as follows:
Figure BDA0001253862140000142
wherein, AvgScorejAvgScore as the average probability score value currently being normalizediThe average probability score value of the ith training sample is obtained, and n is the number of training samples in the corpus.
The second function formula is directly determined as follows:
Figure BDA0001253862140000143
wherein, SdscorejSdscore, the current normalized probability score standard deviation valueiThe standard deviation value of the probability score of the ith training sample is shown, and n is the number of the training samples in the corpus.
It should be noted that, in order to improve the recognition efficiency of the semantic-free text, the step of training the model and the step of determining the normalization formula may be performed in parallel. That is, after preprocessing the training samples, the corpus can be divided into two sets: a training set and a testing set. For each training sample in the training set, steps 230-240 are performed. And steps 310-350 are performed for the training samples in the test set. Of course, in practical applications, some training samples containing random keyboard strokes, word breaking or deformed contents (such as " gumbo" or "fa Lung g0 ng") may be added to the test set.
In summary, the three parts of model training, normalization formula determination and semantic-free text recognition can be implemented independently or in combination. When the three parts are combined to implement, the technical scheme of the application can also be shown in fig. 4. In fig. 4, training samples in the corpus may be collected first, where the training samples may be chinese text, english text, or text in other target languages. Then, preprocessing the training samples in the corpus, including: removing interference elements, converting complex and simplified characters, reducing dimensions of numbers, splitting clauses and the like. After preprocessing the training samples, the corpus can be divided into two sets: a training set and a testing set, wherein training samples in the training set can be used for training the n-gram language model. After the n-gram language model is trained, a normalization formula can be determined in conjunction with the language model. At this point, the model training part and the normalization formula determination part are completed. The text to be recognized can then be recognized. For the text to be recognized, the text to be recognized is preprocessed firstly. And then, the trained language models and the determined normalization formulas can be integrated, and the text to be recognized is scored. And finally, identifying whether the text to be identified is a semantic-free text or not according to the scoring result (namely, the comprehensive score value).
In addition, this application still has following advantage:
1) training samples of the corpus do not need to be marked manually, abnormal contents such as broken characters and varieties can be identified accurately, the scheme processing speed is high, massive texts to be identified can be identified quickly, and the identification efficiency and accuracy of the semantic-free texts are improved.
2) Training samples of the target language can be organized in a corpus, and a language model for the target language is trained, so that all texts of non-target languages can be identified as semanteme-free texts.
3) The method can not only identify the text with semantic factors, but also identify the low-quality text randomly tapped by input devices such as a keyboard and the like.
Corresponding to the above method for recognizing a semantic-free text, an embodiment of the present invention further provides a device for recognizing a semantic-free text, as shown in fig. 5, where the device includes:
an obtaining unit 501 is configured to obtain a text to be recognized.
The preprocessing unit 502 is configured to preprocess the text to be recognized acquired by the acquisition unit 501.
A determining unit 503, configured to determine each word sequence of the text to be recognized after being preprocessed by the preprocessing unit 502.
The determining unit 503 is further configured to determine probability score values of the word sequences according to the N-gram language model.
The determining unit 503 is further configured to determine an average probability score value and/or a standard deviation value of the probability scores of the texts to be recognized according to the probability score values of the word sequences and the number of the word sequences.
The determining unit 503 is further configured to determine a comprehensive score value of the text to be recognized according to the average probability score value and/or the standard deviation of the probability score.
The identifying unit 504 is configured to identify the text to be identified as the semantic-free text when the composite score value determined by the determining unit 503 meets a preset condition.
Optionally, the preprocessing unit 502 may be specifically configured to:
removing interference elements in the text to be recognized;
converting traditional characters in a text to be recognized into simplified characters;
converting the numeric character strings in the text to be recognized into a preset format;
and splitting the clauses of the text to be recognized.
Optionally, the apparatus may further include:
a training unit 505, configured to obtain a training sample set, where the training sample set includes at least one training sample.
Wherein, the training samples may include: text in chinese, english, and/or text in other target languages. The text may include news content, blog content, forum content, chat content, and/or the like.
And preprocessing each training sample in the training sample set.
And determining each word combination in the training samples for each preprocessed training sample.
And counting the occurrence times of each word combination in each training sample.
The word combinations and the times constitute an N-gram language model.
Optionally, the apparatus may further include:
the determining unit 506 is configured to determine, for each word combination in the word combinations, whether the word combination is included in a preset word set.
The deleting unit 507 is configured to delete the word combination if the determining unit 506 determines that the word combination is not included in the preset word set.
Optionally, counting the number of times of occurrence of each word combination in each training sample includes:
and counting the occurrence times of each word combination subjected to deletion processing in each training sample.
Optionally, the determining unit 503 may be further specifically configured to:
and carrying out normalization processing on the average probability fraction value to obtain a first processing result.
And carrying out normalization processing on the probability score standard difference value to obtain a second processing result.
The first processing result is compared with the second processing result.
And when the first processing result is larger, determining the comprehensive score value of the text to be recognized according to the first processing result.
And when the second processing result is larger, determining the comprehensive score value of the text to be recognized according to the second processing result.
According to the first processing result, determining a comprehensive score value of the text to be recognized comprises the following steps:
and amplifying the first processing result, and determining the amplified first processing result as a comprehensive score value of the text to be recognized.
According to the second processing result, determining a comprehensive score value of the text to be recognized, which comprises the following steps:
and amplifying the second processing result, and determining the amplified second processing result as the comprehensive score value of the text to be recognized.
Optionally, the apparatus may further include:
a selecting unit 508, configured to select at least one sample word sequence of each training sample from each word combination in each training sample.
The determining unit 503 is further configured to determine a probability score value of each sample word sequence of each training sample according to the N-gram language model.
The determining unit 503 is further configured to determine an average probability score value and a standard deviation value of the probability score of each training sample according to the probability score value of each sample word sequence of each training sample and the number of sample word sequences included in each training sample.
A sorting unit 509, configured to sort the training samples according to the average probability score value and the standard deviation of the probability score of each training sample determined by the determining unit 503.
The determining unit 503 is further configured to determine, according to the sorting result of the sorting unit 509, a first function formula for performing normalization processing on the average probability score value, and determine a second function formula for performing normalization processing on the probability score standard deviation value.
Optionally, the normalizing the average probability score value includes:
and normalizing the average probability fraction value according to a first function formula.
The probability score standard deviation value is normalized, and the normalization processing comprises the following steps:
and according to a second function formula, carrying out normalization processing on the probability fraction standard deviation value.
The functions of the functional modules of the device in the embodiment of the present application may be implemented through the steps in the method embodiment described above, and therefore, the specific working process of the device provided in the present application is not repeated herein.
According to the method and the device for recognizing the semanteme-free text, the obtaining unit 501 obtains the text to be recognized. The preprocessing unit 502 preprocesses the text to be recognized. The determination unit 503 determines each word sequence of the preprocessed text to be recognized. The determination unit 503 determines a probability score value of each word sequence according to the N-gram language model. The determining unit 503 determines an average probability score value and/or a standard deviation value of the probability scores of the texts to be recognized according to the probability score value of each word sequence and the number of the word sequences. The determining unit 503 determines a comprehensive score value of the text to be recognized according to the average probability score value and/or the standard deviation value of the probability score. When the integrated score value satisfies a preset condition, the recognition unit 504 recognizes the text to be recognized as a semantic-free text. Therefore, the accuracy and comprehensiveness of the semantic-free text recognition can be improved.
Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims (12)

1. A method for recognizing semantic-free text is characterized by comprising the following steps:
acquiring a text to be identified;
preprocessing the text to be recognized;
determining each word sequence of the preprocessed text to be recognized;
determining the probability score value of each word sequence according to the N-gram language model;
determining an average probability score value and a probability score standard difference value of the text to be recognized according to the probability score value of each word sequence and the number of the word sequences;
respectively carrying out normalization processing on the average probability score value and the probability score standard difference value to obtain a corresponding first processing result and a corresponding second processing result;
comparing the first processing result with the second processing result;
if the first processing result is larger, performing amplification processing on the first processing result, and determining the amplified first processing result as a comprehensive score value of the text to be recognized;
if the second processing result is larger, performing amplification processing on the second processing result, and determining the amplified second processing result as a comprehensive score value of the text to be recognized;
and when the comprehensive score value meets a preset condition, identifying the text to be identified as a semantic-free text.
2. The method according to claim 1, wherein the preprocessing the text to be recognized comprises any one or more of the following steps:
removing interference elements in the text to be recognized;
converting traditional characters in the text to be recognized into simplified characters;
converting the numeric character strings in the text to be recognized into a preset format;
and splitting clauses of the text to be recognized.
3. The method of claim 1 or 2, further comprising: the step of training the N-gram language model comprises the following steps:
acquiring a training sample set, wherein the training sample set comprises at least one training sample;
preprocessing each training sample in the training sample set;
determining each word combination in each preprocessed training sample;
counting the occurrence times of each word combination in each training sample;
the respective word combinations and the number of times constitute the N-gram language model.
4. The method of claim 3, wherein the training samples comprise:
chinese text, English text and/or text in other target languages; the text includes news content, blog content, forum content, and/or chat content.
5. The method of claim 3, wherein after said determining respective word combinations in the training sample, further comprising:
judging whether the word combination is contained in a preset word set or not for each word combination in each word combination, and deleting the word combination if the word combination is not contained in the preset word set;
the counting of the occurrence times of each word combination in each training sample comprises:
and counting the occurrence times of each word combination subjected to deletion processing in each training sample.
6. The method of claim 3, further comprising:
selecting at least one sample word sequence of each training sample from each word combination in each training sample;
determining a probability score value of each sample word sequence of each training sample according to the N-gram language model;
determining an average probability score value and a probability score standard deviation value of each training sample according to the probability score value of each sample word sequence of each training sample and the number of the sample word sequences contained in each training sample;
sequencing the training samples according to the average probability score value and the standard deviation value of the probability score of the training samples;
according to the sequencing result, determining a first function formula for carrying out normalization processing on the average probability score value, and determining a second function formula for carrying out normalization processing on the probability score standard deviation value;
the normalizing the average probability score value includes:
according to the first function formula, carrying out normalization processing on the average probability fraction value;
the normalizing the probability score standard deviation value comprises:
and carrying out normalization processing on the probability fraction standard deviation value according to the second function formula.
7. An apparatus for recognizing semantic-free text, comprising:
the acquiring unit is used for acquiring a text to be recognized;
the preprocessing unit is used for preprocessing the text to be recognized acquired by the acquiring unit;
the determining unit is used for determining each word sequence of the text to be recognized after the preprocessing unit preprocesses the text;
the determining unit is further configured to determine a probability score value of each word sequence according to an N-gram language model;
the determining unit is further configured to determine an average probability score value and a standard deviation of the probability scores of the texts to be recognized according to the probability score values of the word sequences and the number of the word sequences;
the determining unit is further configured to perform normalization processing on the average probability score value and the probability score standard deviation value respectively to obtain a corresponding first processing result and a corresponding second processing result;
comparing the first processing result with the second processing result;
if the first processing result is larger, performing amplification processing on the first processing result, and determining the amplified first processing result as a comprehensive score value of the text to be recognized;
if the second processing result is larger, performing amplification processing on the second processing result, and determining the amplified second processing result as a comprehensive score value of the text to be recognized;
and the identification unit is used for identifying the text to be identified as the semanteme-free text when the comprehensive score value determined by the determination unit meets a preset condition.
8. The apparatus according to claim 7, wherein the preprocessing unit is specifically configured to:
removing interference elements in the text to be recognized;
converting traditional characters in the text to be recognized into simplified characters;
converting the numeric character strings in the text to be recognized into a preset format;
and splitting clauses of the text to be recognized.
9. The apparatus of claim 7 or 8, further comprising:
the training unit is used for acquiring a training sample set, and the training sample set comprises at least one training sample;
preprocessing each training sample in the training sample set;
determining each word combination in each preprocessed training sample;
counting the occurrence times of each word combination in each training sample;
the respective word combinations and the number of times constitute the N-gram language model.
10. The apparatus of claim 9, wherein the training samples comprise:
chinese text, English text and/or text in other target languages; the text includes news content, blog content, forum content, and/or chat content.
11. The apparatus of claim 9, further comprising:
the judging unit is used for judging whether each word combination in each word combination is contained in a preset word set or not;
a deleting unit, configured to delete the word combination if the determining unit determines that the word combination is not included in the preset word set;
the counting of the occurrence times of each word combination in each training sample comprises:
and counting the occurrence times of each word combination subjected to deletion processing in each training sample.
12. The apparatus of claim 9, further comprising:
a selecting unit, configured to select at least one sample word sequence of each training sample from each word combination in each training sample;
the determining unit is further configured to determine a probability score value of each sample word sequence of each training sample according to the N-gram language model;
the determining unit is further configured to determine an average probability score value and a standard deviation of the probability scores of the training samples according to the probability score value of each sample word sequence of each training sample and the number of sample word sequences included in each training sample;
the sequencing unit is used for sequencing the training samples according to the average probability score value and the standard deviation value of the probability score of the training samples determined by the determining unit;
the determining unit is further configured to determine, according to the sorting result of the sorting unit, a first function formula for performing normalization processing on the average probability score value, and determine a second function formula for performing normalization processing on the probability score standard deviation value;
the normalizing the average probability score value includes:
according to the first function formula, carrying out normalization processing on the average probability fraction value;
the normalizing the probability score standard deviation value comprises:
and carrying out normalization processing on the probability fraction standard deviation value according to the second function formula.
CN201710182218.7A 2017-03-24 2017-03-24 Semantic-free text recognition method and device Active CN108628822B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710182218.7A CN108628822B (en) 2017-03-24 2017-03-24 Semantic-free text recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710182218.7A CN108628822B (en) 2017-03-24 2017-03-24 Semantic-free text recognition method and device

Publications (2)

Publication Number Publication Date
CN108628822A CN108628822A (en) 2018-10-09
CN108628822B true CN108628822B (en) 2021-12-07

Family

ID=63707631

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710182218.7A Active CN108628822B (en) 2017-03-24 2017-03-24 Semantic-free text recognition method and device

Country Status (1)

Country Link
CN (1) CN108628822B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111681670B (en) * 2019-02-25 2023-05-12 北京嘀嘀无限科技发展有限公司 Information identification method, device, electronic equipment and storage medium
CN110493088B (en) * 2019-09-24 2021-06-01 国家计算机网络与信息安全管理中心 Mobile internet traffic classification method based on URL
CN112633009A (en) * 2020-12-29 2021-04-09 扬州大学 Identification method for random combination uploading field
CN113094543B (en) * 2021-04-27 2023-03-17 杭州网易云音乐科技有限公司 Music authentication method, device, equipment and medium
CN113378541B (en) * 2021-05-21 2023-07-07 标贝(北京)科技有限公司 Text punctuation prediction method, device, system and storage medium
CN113657118A (en) * 2021-08-16 2021-11-16 北京好欣晴移动医疗科技有限公司 Semantic analysis method, device and system based on call text
CN115374779B (en) * 2022-10-25 2023-01-10 北京海天瑞声科技股份有限公司 Text language identification method, device, equipment and medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103605690A (en) * 2013-11-04 2014-02-26 北京奇虎科技有限公司 Device and method for recognizing advertising messages in instant messaging
CN103942191A (en) * 2014-04-25 2014-07-23 中国科学院自动化研究所 Horrific text recognizing method based on content

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7493293B2 (en) * 2006-05-31 2009-02-17 International Business Machines Corporation System and method for extracting entities of interest from text using n-gram models

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103605690A (en) * 2013-11-04 2014-02-26 北京奇虎科技有限公司 Device and method for recognizing advertising messages in instant messaging
CN103942191A (en) * 2014-04-25 2014-07-23 中国科学院自动化研究所 Horrific text recognizing method based on content

Also Published As

Publication number Publication date
CN108628822A (en) 2018-10-09

Similar Documents

Publication Publication Date Title
CN108628822B (en) Semantic-free text recognition method and device
US11093854B2 (en) Emoji recommendation method and device thereof
CN107045496B (en) Error correction method and error correction device for text after voice recognition
CN109446404B (en) Method and device for analyzing emotion polarity of network public sentiment
US8983826B2 (en) Method and system for extracting shadow entities from emails
Eskander et al. Foreign words and the automatic processing of Arabic social media text written in Roman script
KR20160121382A (en) Text mining system and tool
CN112287684A (en) Short text auditing method and device integrating variant word recognition
CN112883730B (en) Similar text matching method and device, electronic equipment and storage medium
CN107688630A (en) A kind of more sentiment dictionary extending methods of Weakly supervised microblogging based on semanteme
CN114266256A (en) Method and system for extracting new words in field
CN112084308A (en) Method, system and storage medium for text type data recognition
Farhoodi et al. N-gram based text classification for Persian newspaper corpus
CN113282717B (en) Method and device for extracting entity relationship in text, electronic equipment and storage medium
Hardeniya et al. An approach to sentiment analysis using lexicons with comparative analysis of different techniques
CN110489759B (en) Text feature weighting and short text similarity calculation method, system and medium based on word frequency
CN112395881A (en) Material label construction method and device, readable storage medium and electronic equipment
CN108763258B (en) Document theme parameter extraction method, product recommendation method, device and storage medium
CN109947932B (en) Push information classification method and system
CN115827867A (en) Text type detection method and device
CN112926297B (en) Method, apparatus, device and storage medium for processing information
CN115577109A (en) Text classification method and device, electronic equipment and storage medium
CN110941713A (en) Self-optimization financial information plate classification method based on topic model
TWI534640B (en) Chinese network information monitoring and analysis system and its method
CN109597879B (en) Service behavior relation extraction method and device based on 'citation relation' data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20201016

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant before: Advanced innovation technology Co.,Ltd.

Effective date of registration: 20201016

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Advanced innovation technology Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Applicant before: Alibaba Group Holding Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant