WO2019096032A1 - 文本信息处理方法、计算机设备及计算机可读存储介质 - Google Patents

文本信息处理方法、计算机设备及计算机可读存储介质 Download PDF

Info

Publication number
WO2019096032A1
WO2019096032A1 PCT/CN2018/114188 CN2018114188W WO2019096032A1 WO 2019096032 A1 WO2019096032 A1 WO 2019096032A1 CN 2018114188 W CN2018114188 W CN 2018114188W WO 2019096032 A1 WO2019096032 A1 WO 2019096032A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
word
parameter
text information
information
Prior art date
Application number
PCT/CN2018/114188
Other languages
English (en)
French (fr)
Inventor
彭思翔
钱淑钗
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2019096032A1 publication Critical patent/WO2019096032A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Definitions

  • the present application relates to the field of communications technologies, and in particular, to a text information processing method, a computer device, and a computer readable storage medium.
  • Text information is the main information carrier of the social platform.
  • the templated text can be generated and transmitted through the model.
  • the received text information can be identified, so that the text information can be processed according to the recognition result, for example, intercepting the identified spam, or identifying the pornographic information. Information is blocked, and so on, so how to accurately identify spam or pornography is critical.
  • various embodiments of the present application provide a text information processing method, a computer device, and a computer readable storage medium.
  • a text information processing method implemented by a computer device, comprising:
  • a computer apparatus comprising a processor and a memory, the memory storing computer readable instructions, the computer readable instructions being executed by the processor such that the processor performs the following steps:
  • a non-transitory computer readable storage medium storing computer readable instructions, when executed by one or more processors, causes the one or more processors to perform the following steps:
  • FIG. 1 is a schematic diagram of a scenario of a text information processing system provided by an embodiment of the present application.
  • FIG. 2 is a schematic flowchart of a text information processing method provided by an embodiment of the present application.
  • FIG. 3 is another schematic flowchart of a text information processing method provided by an embodiment of the present application.
  • FIG. 4 is another schematic flowchart of a text information processing method provided by an embodiment of the present application.
  • FIG. 5 is another schematic flowchart of a text information processing method according to an embodiment of the present application.
  • FIG. 6 is another schematic flowchart of a text information processing method provided by an embodiment of the present application.
  • FIG. 7 is another schematic flowchart of a text information processing method according to an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of hardware of a computer device according to an embodiment of the present disclosure.
  • FIG. 9 is a schematic structural diagram of hardware of a computer device according to an embodiment of the present disclosure.
  • FIG. 10 is a schematic structural diagram of hardware of a computer device according to an embodiment of the present disclosure.
  • FIG. 11 is a schematic structural diagram of hardware of a computer device according to an embodiment of the present application.
  • FIG. 12 is a schematic structural diagram of a server according to an embodiment of the present application.
  • the embodiment of the present application provides a text information processing method, device, and storage medium.
  • FIG. 1 is a schematic diagram of a scenario of a text information processing system according to an embodiment of the present disclosure.
  • the text information processing system may include a text information processing device, and the text information processing device may be integrated into a server, and is mainly used for Receiving the text information to be identified, performing word segmentation on the text information according to a preset word-cutting rule, and generating at least one word. Then, parameters corresponding to at least one word are obtained, each parameter identifies a word, and then feature information of the text information is determined according to the parameter and the preset training model, and the training model is trained by at least one type of templated text. Finally, the recognition result can be determined according to the feature information, that is, the type of the templated text to which the text information belongs is identified according to the feature information, and the like.
  • the text information processing system further includes a memory for storing the training model, and the server may obtain the training model according to the training sample in advance, and store the training model in the memory, so that the subsequent recognition of the text information to be recognized may be performed.
  • the text information to be recognized by the training model is directly obtained from the memory.
  • the text information processing system further includes a terminal (for example, terminal A) or a plurality of terminals (for example, terminal A, terminal B, and terminal C, etc.), and the terminal may be stored on a tablet computer, a mobile phone, a notebook computer, a desktop computer, or the like.
  • the unit is equipped with a microprocessor and a computing terminal.
  • the terminal may send the text information to be recognized to the computer device, so that the computer device performs corresponding processing on the received text information to be recognized.
  • the terminal may send a plurality of training samples to the computer device to cause the computer device to train the received plurality of training samples, generate a training model, and the like.
  • a computer device which may be integrated into a network device such as a server or a gateway.
  • a text information processing method includes: receiving text information to be recognized; performing word segmentation processing on the text information according to a preset word cutting rule to generate at least one word; acquiring parameters corresponding to at least one word, each parameter identifying a word Determining the feature information of the text information according to the parameter and the preset training model, the training model is trained by at least one type of templated text; and identifying the type of the templated text to which the text information belongs according to the feature information.
  • FIG. 2 is a schematic flowchart diagram of a text information processing method according to a first embodiment of the present application.
  • the text information processing method includes:
  • step S101 text information to be recognized is received.
  • the text information processing method can be applied to e-mail, instant messaging (eg, WeChat, QQ, etc.), blog, circle of friends, information push, live broadcast, etc., and the scene information that needs to be recognized by the terminal is recognized. .
  • the computer device receives the text information to be recognized, and the text information may be a tablet, a mobile phone, a computer, etc., a message sent by email, a message sent through instant messaging, a message published through a blog, and a push message displayed through a bullet box. Information published through a circle of friends and information displayed through live broadcasts.
  • the text information may include information such as Chinese, English, punctuation or expression, and the specific content is not limited herein.
  • step S102 the text information is subjected to word-cutting processing according to a preset word-cutting rule to generate at least one word.
  • the computer device performs a word segmentation process on the received text information to be recognized according to a preset word-cutting rule
  • the preset word-cutting rule may be a word-cutting according to the preset number of words per interval, for example, two words per interval Cut into one word for one word, or one word per interval.
  • the preset word-cutting rule may also be a uniform word-cutting according to the total number of words of the text information. For example, when the total number of words of a piece of text information is 15, it may be equally divided into one word every 5 words.
  • the preset word-cutting rule may also be a random word-cutting. For example, when the total number of words of a piece of text information is 15, only three groups of two words are extracted. Or, the text information with a total of 15 words is cut into a word composed of 2 words, a word composed of 1 word, a word composed of 9 words, and a word composed of 3 words.
  • the preset word-cutting rules may be flexibly set according to actual needs, for example, dictionary-based word-cutting, statistics-based word-cutting, or artificial intelligence-based word-cutting, etc., and the specific content is not limited herein. .
  • the word cutting rule of the text information to be recognized may be determined according to the mapping relationship
  • the mapping A relationship is a mapping relationship between a set of words and a set of parameters.
  • At least one word may be generated. As shown in FIG. 3, only the word 1 may be generated, or the word 1 to the word n may be generated, n is an integer, and n>1.
  • the word may be composed of a Chinese character, or may be composed of a plurality of words and other symbols, or may be composed of English.
  • the term may include a variant of the word, and the specific content is not limited herein.
  • Variant words refer to words that are different from normative words. For example, the normative words are “beauty” and the corresponding variants are “ ⁇ ”.
  • the computer device may perform word-cutting processing on the received text information in real time or every preset time, or may perform word-cutting processing on receiving a preset amount of text information.
  • step S103 parameters corresponding to at least one word are acquired.
  • each word corresponds to one parameter.
  • Each parameter identifies a word, which can be a number or a character that uniquely identifies the word. For example, the parameter corresponding to "we” is 0.1, and the parameter corresponding to "I" is 0.5.
  • the computer device pre-stores a training model including a mapping relationship between words and parameters
  • the step of acquiring parameters corresponding to the at least one word may include: acquiring at least one according to a mapping relationship in the training model The parameter corresponding to the word.
  • the computer device obtains a parameter corresponding to the word by calculating: first, acquiring a target frequency in which the word exists in the text information to be identified, where the target frequency is a frequency in which the word exists in the text information to be identified.
  • the target frequency is a frequency in which the word exists in the text information to be identified.
  • the text information including the word is in the target reverse text frequency of the plurality of pieces of text information
  • the target reverse text frequency is the text information of the word at the
  • the computer device may preferentially acquire parameters corresponding to at least one word according to the mapping relationship.
  • the parameter corresponding to the word is calculated according to the target frequency and the target reverse text frequency. .
  • step S104 the feature information of the text information is determined according to the parameters and the preset training model.
  • the computer device is pre-configured with a training model that is trained from at least one type of templated text.
  • the training model is trained from at least one type of templated text of erotic information, drug sales information, investment information, pyramid sales information, and the like.
  • ⁇ ], hello ⁇ [D
  • E] has benefits", the variable is " ⁇ ” or " Miss”, and the variable is "D” or "V” or “E”, and the template part is "Look, hello ⁇ have benefits”.
  • the step of training model generation can include:
  • Step (1) acquiring a plurality of training samples corresponding to the templated text
  • Step (2) performing a word segmentation process for each training sample according to a preset word-cutting rule to generate a word set including a plurality of words;
  • Step (3) preprocessing the word set to generate a parameter set, each parameter in the parameter set is used to identify each word in the word set;
  • Step (4) performing clustering processing on the plurality of training samples according to the parameter set to generate a text clustering list
  • Step (5) generates a training model based on the text clustering list.
  • the plurality of training samples corresponding to the templated text may be randomly obtained from the received historical text information, or may be from pornographic information, A plurality of training samples are extracted from historical text information of different scenes such as drug sales information and pyramid sales information, and a plurality of training samples corresponding to the templated text may be created according to different scenes.
  • the number of the training samples and the manner of obtaining the samples may be flexibly set according to actual needs, and the specific content is not limited herein.
  • each training sample is separately processed according to a predetermined word-cutting rule, and the predetermined word-cutting rule can use any word-cutting algorithm, in order to improve the reliability of processing the text information.
  • the preset word-cutting rule is consistent with the above-mentioned word-cutting rule for word-cutting processing of text information, and is not described here.
  • a set of words containing a plurality of words can be generated, as shown in FIG. It may also be that each training sample corresponds to the word set 1 to the word set n, and the word set corresponding to the plurality of training samples is formed, and the words included in the word set 1 to the word set n may be one or more, n is an integer, and n >1.
  • a set of words containing 100 words can be generated; if each training sample is cut into 6 words, 600 words can be generated. Set of words.
  • each parameter in the parameter set is used to identify each word in the word set. It may also be that each training sample corresponds to the word set 1 to the word set n, respectively corresponding to the parameter set 1 to the parameter set n, and constitutes a parameter set corresponding to the plurality of training samples, where n is an integer and n>1.
  • the step of pre-processing the set of words, the step of generating the parameter set may include: obtaining a frequency in which each word in the word set exists in each training sample, and a target training sample containing the word in the plurality of training samples The reverse text frequency; the target parameter corresponding to each word is generated according to the frequency and the reverse text frequency; and the parameter set is generated according to the target parameter corresponding to each word.
  • the pre-processing of the word set includes a term frequency-inverse document frequency (tf-idf) conversion, which is a weighting technique for information retrieval and text mining, which can be used to evaluate a The degree to which a word is important for a piece of textual information, or for one of a plurality of training samples.
  • the importance of a word increases proportionally with the number of times it appears in the text message, and decreases inversely with its frequency of occurrence in multiple training samples.
  • the tf in tf-idf represents the word frequency.
  • the term frequency (tf) refers to the frequency at which a given word appears in the file, that is, a word in this embodiment is The frequency that exists in a training sample.
  • the idf in tf-idf indicates the frequency of the reverse text, which is the normalization of the number of words (ie, the number of occurrences). Since the same word may have a higher number of words in a shorter file in a longer file, Regardless of whether the word is important or not, the text frequency is reversed to prevent the word from being biased towards a longer file.
  • the calculation formula of the frequency (ie, word frequency) existing in the training sample dj is:
  • tf i,j represents the word frequency of the word ti in the training sample dj
  • n i,j represents the number of occurrences of the word ti in the training sample dj
  • the inverse document frequency (idf) is a measure of the universal importance of a word.
  • the inverse text frequency of the target training sample containing the word ti in the plurality of training samples may be obtained by dividing the total number of the plurality of training samples by the number of target training samples containing the word ti, and then obtaining
  • the trader takes the logarithm and the formula is as follows:
  • Idf i represents the reverse text frequency
  • represents the total number of training samples
  • the frequency of each word in the word set in each training sample is calculated, and after the reverse text frequency of the target training sample containing the words in the plurality of training samples, each word corresponding can be generated according to the frequency and the reverse text frequency.
  • the target parameter and then generate a parameter set according to the target parameter corresponding to each word.
  • Each word in the word set can form a one-to-one mapping relationship with each parameter in the parameter set.
  • the mapping relationship can be understood as a dictionary.
  • the parameter corresponding to the at least one word can be searched in the dictionary without recalculation.
  • the parameter corresponding to the word needs to be calculated according to the aforementioned tf-idf conversion formula.
  • the plurality of training samples may be clustered according to the parameter set, and the clustering processing may include a K-means clustering algorithm or a hierarchical clustering algorithm (Balanced Iterative Reducing and Clustering using Hierarchies, BIRCH).
  • the specific content is not limited here.
  • a text clustering list may be generated.
  • the text clustering list may include a list formed by one type of clustering text, or include multiple types.
  • the type of clustered text forms a corresponding plurality of lists, each of which contains one type of clustered text.
  • a training model can be generated from the text clustering list, as shown in FIG.
  • the text information processing method further comprises: transforming the mapping relationship between the word set and the parameter set, and generating the mapping relationship on the preset space.
  • the projection relationship
  • the step of transforming the mapping relationship between the word set and the parameter set to generate a projection relationship of the mapping relationship on the preset space comprises:
  • each row vector of the sample matrix is a parameter corresponding to the word obtained after each training sample is processed
  • a transformation matrix is generated according to the covariance matrix and the diagonal matrix, and the transformation matrix is set as a projection relationship.
  • mapping relationship between the word set and the parameter set is transformed into the n*p-dimensional sample matrix dataMat.
  • the number of rows n of the sample matrix represents the number of training samples, and the number of columns p of the sample matrix indicates that each training sample is cut.
  • each row of the generated matrix needs to be consistent. Since the number of generated words after each word of the training sample can be the same, it can be different, so for the number is different, in order to ensure that the vector length of each line of the generation matrix is consistent, you can use 0
  • a row vector having a short vector length is complemented, so that the vector length of each row is uniform, and each row of the sample matrix corresponds to a parameter corresponding to the word obtained after each training sample is processed.
  • the covariance matrix X of the sample matrix dataMat is calculated, and the eigenvalues of the sample matrix dataMat are calculated, and a diagonal matrix D is generated according to the eigenvalues, and the diagonal matrix D is a diagonal matrix of (p, p) dimensions, which includes Characteristic values ⁇ 1 , ⁇ 2 , ... ⁇ p .
  • the covariance matrix X can calculate the transformation matrix P by Singular value decomposition (SVD), and the calculation formula is as follows:
  • the transformation matrix P is an orthogonal matrix of (p, p) dimensions, which is the transformation matrix P, and each column of the transformation matrix P is a feature vector of the covariance matrix X.
  • the transformation matrix P can be solved by SVD, and the transformation matrix P is set as a projection relationship of the sample matrix dataMat (ie, the mapping relationship) on the preset space.
  • the preset space may be a principal component space, which is a parameter corresponding to a word of the training sample.
  • P j may be the first j column of the transformation matrix P, that is, P j is a matrix of (p, j) dimensions, and Y j is a matrix of (n, j) dimensions.
  • the inverse mapping relationship may be generated by mapping the transformation from the principal component space to the original space according to the transformation matrix and the projection relationship, and the word corresponding to the parameter may be determined according to the inverse mapping relationship.
  • the step of generating the training model according to the text clustering list may include: generating the training model according to the mapping relationship, the projection relationship, and the text clustering list.
  • the mapping relationship between the word set and the parameter set (which may be a sample matrix), the projection relationship of the mapping relationship on the preset space (which may be a transformation matrix), and the training model generated by the text cluster list are stored.
  • the computer device may determine the feature information of the text information according to the parameter and the training model, and the feature information may include the category of the text information in the text cluster list, the number of texts corresponding to the category, and the text information and The similarity between the training samples in the text clustering list, etc., the feature information can also be flexibly set according to actual needs, and the specific content is not limited herein.
  • the step of determining the feature information of the text information according to the parameter and the preset training model may include: determining feature information of the text information according to the parameter, the projection relationship in the training model, and the text clustering list in the training model.
  • the step of determining the feature information of the text information according to the parameters, the projection relationship in the training model, and the text clustering list in the training model may include:
  • the category to which the text information belongs in the text clustering list, the number of texts corresponding to the category, and the similarity between the text information and the training samples in the text clustering list are determined according to the shortest distance.
  • the parameters corresponding to the words are projected on a preset space (for example, a principal component space) according to a determined projection relationship, and projection parameters are generated. And acquiring a centroid of the text clustering list generated by the projection in the clustering region, the centroid may be one or more.
  • the distance between the projection parameter and the centroid is calculated, and the distance may be a Euclidean distance, a Chebyshev distance, or a Hamming distance.
  • the specific content is not limited herein.
  • Determine the shortest distance between the projection parameter and the centroid For example, when there is only one centroid, the distance between the centroid and the projection parameter is the shortest distance; when there are multiple centroids, from the plurality of centroids and projection parameters Take the shortest distance between the distances.
  • the category to which the text information belongs in the text clustering list, the number of texts corresponding to the category, and the similarity between the text information and the training samples in the text clustering list may be determined according to the shortest distance.
  • multiple training samples may be allocated to multiple text libraries, and then each training sample in each text library is separately processed, and the processing is performed.
  • the training model corresponding to each text library is followed by the text information according to the training model in each text library.
  • step S105 the type of the templated text to which the text information belongs is identified based on the feature information.
  • the recognition result of the text information may be obtained according to the feature information, as shown in FIG. 3, that is, the type of the templated text to which the text information belongs is identified, and the templated text according to the text information may be The type determines whether the text message is intercepted.
  • the templated text may include multiple types. When the text information belongs to any one of the types, the text information may be intercepted; when the text information does not belong to any one of the types, the text information may be performed. Forward to the corresponding terminal.
  • the templated text may include a first type and a second type, the first type is templated text of bad information, and the second type is normal templated text.
  • the text information may be intercepted; when the text information belongs to the second type, the text information may be forwarded to the corresponding terminal.
  • the black industry mainly uses automatic machine to generate templated texts to be automatically sent. Therefore, in order to intercept the information and pornographic information of the sales products sent by the black industry.
  • the computer device can be used to identify the received text information according to the training model in order to intercept the bad information.
  • the text information processing method performs word segmentation processing on the received text information through a preset word-cutting rule, generates at least one word, and acquires parameters corresponding to at least one word, each parameter. Identifying a word; then, determining feature information of the text information according to the obtained parameter and the preset training model, the training model is trained by at least one type of templated text, and then identifying the templated text to which the text information belongs according to the feature information Types of.
  • the recognition result can be prevented from being interfered by interference information such as word variants, punctuation marks, and/or other characters, thereby improving the accuracy of identifying the text information. .
  • the embodiment of the present application provides a text information processing method
  • the computer device may allocate the acquired plurality of training samples to a plurality of text databases in advance, and then respectively perform word segmentation and aggregation on each of the plurality of text databases. Class and other processing, generate a sub-training model corresponding to each text library. Finally, when the text information to be recognized is received, the text information may be identified according to the sub-training model corresponding to each text library.
  • FIG. 5 is a schematic flowchart diagram of a text information processing method according to an embodiment of the present application.
  • the method flow can include:
  • Step S201 Acquire a plurality of training samples corresponding to the templated text, and allocate the plurality of training samples to the plurality of text libraries.
  • the algorithm processes the parameters to generate a training model.
  • the calculation complexity is large. For example, according to the n*p-dimensional sample matrix generated by the mapping relationship between the word set and the parameter set, when the number n of training samples increases, the dimension p of the sample matrix dataMat also increases, which increases the complexity of the SVD algorithm. . Therefore, in this embodiment, the Boosting SVD algorithm is used to allocate a plurality of training samples to a plurality of text libraries, and the text information in each text library is processed separately. For example, each library is calculated by the SVD algorithm, which can greatly reduce the computational complexity.
  • the Boosting SVD algorithm is a combination of the clustering Boosting algorithm and the SVD algorithm.
  • the Boosting algorithm is an algorithm used to improve the accuracy of the weak classification algorithm. This algorithm combines the series of prediction functions by constructing a series of prediction functions. Become a prediction function. That is to say, the Boosting algorithm is also a framework algorithm, which mainly obtains a subset of samples by operation on the sample set, and then uses a weak classification algorithm to train a series of base classifiers on the sample subset.
  • Boosting algorithm uses the thinking of the Boosting algorithm to identify the text information.
  • the plurality of training samples may be allocated to the plurality of text databases.
  • the plurality of text libraries may include the text library 1 to the text library n, where n is an integer. And n>1.
  • the number of the training samples and the manner of obtaining the samples may be flexibly set according to actual needs, and the specific content is not limited herein.
  • the training samples in each text library may be randomly assigned or allocated according to templated texts of different scenes. For example, text library 1 allocates training samples corresponding to pornographic information, and text library 2 allocates drugs for dispensing.
  • text library 1 allocates training samples corresponding to pornographic information
  • text library 2 allocates drugs for dispensing.
  • the training samples corresponding to the information, etc., the specific content is not limited herein.
  • Step S202 Perform a first pre-processing on each training sample of each text library, and obtain a mapping relationship, a projection relationship, and a sub-category list corresponding to each text library.
  • the first pre-processing includes word-cutting processing, obtaining parameters corresponding to words, clustering processing, and the like.
  • word-cutting processing includes word-cutting processing, obtaining parameters corresponding to words, clustering processing, and the like.
  • each training sample of each text library is separately subjected to word-cutting processing, and a set of words corresponding to each text library is generated, where the word-cutting rule and the above-mentioned word-cutting rule are Consistent, not repeated here.
  • the parameter set corresponding to the word set in each text library is obtained, such as parameter set 1 to parameter set n in FIG.
  • the method for obtaining the parameter set corresponding to the word set may be that the word frequency tf i,j and the reverse text frequency idf i of each word are calculated by the tf-idf algorithm, and then calculated according to the word frequency tf i,j and the reverse text frequency idf i
  • the parameters corresponding to the words are calculated in a similar manner to the previous calculation methods, and are not described here.
  • a parameter set corresponding to each text library can be generated.
  • Each word in the word set and each parameter in the parameter set can form a one-to-one mapping relationship, that is, the corresponding word set and the parameter set in each text library can form a mapping relationship.
  • text clustering may be performed on each of the plurality of training samples in each text library according to the parameter set of each text library to generate a small class list, as shown in FIG. 6 .
  • the text clustering may include a K-means clustering algorithm or a BIRCH clustering algorithm, etc., and the specific content is not limited herein.
  • Each sub-category list may include a list formed by one type of cluster text, or a plurality of lists including a plurality of types of cluster texts.
  • mapping relationship between the word set and the parameter set in each text library is transformed to generate a projection relationship of the mapping relationship in the preset space.
  • the calculation manner of the projection relationship corresponding to each text library is similar to the foregoing calculation manner, and details are not described herein again.
  • the Boosting SVD algorithm used in the calculation of the projection relationship is calculated by using the SVD algorithm for each text library, which greatly reduces the computational complexity in the SVD calculation phase, and each text is further improved by the Boosting algorithm.
  • Multiple SVD results for the library produce a uniform result that enhances accuracy.
  • Boosting SVD algorithm can effectively solve the problems of SVD's accuracy reduction and high computational complexity on big data, improve the calculation accuracy and reduce the complexity.
  • Step S203 Generate a sub-training model corresponding to each text library according to the mapping relationship, the projection relationship, and the small class list.
  • each text can be generated according to the mapping relationship, the projection relationship, and the small class list.
  • the sub-training model corresponding to the library as shown in FIG. 6, for example, can generate sub-training model 1 to sub-training model n, n is an integer, and n>1.
  • Step S204 Receive text information to be identified, and perform second preprocessing on the text information.
  • the second pre-processing includes word-cutting processing and obtaining parameters corresponding to the words, and the computer device receives the text information to be recognized, and the text information may be a terminal such as a tablet computer, a mobile phone, or a computer, and the information sent to another terminal.
  • the text information may include information such as Chinese, English, punctuation or expression, and the specific content is not limited herein.
  • the terminal A sends a mail to the terminal B through the computer device, at which time the computer device receives the mail and performs a second pre-processing on the text information contained in the mail.
  • the terminal C transmits the promotion information to the plurality of terminals 1 to n (where n is an integer greater than 2) through the computer device, and the computer device receives the promotion information and performs the second pre-processing on the promotion information.
  • the computer device performs word segmentation processing on the received text information to be recognized according to a preset word-cutting rule to generate at least one word. It is possible to generate only the word 1, or to generate the word 1 to the word n, etc., n is an integer, and n>1.
  • the word may be composed of a Chinese character, or may be composed of a plurality of words and other symbols, or may be composed of English. In an embodiment, in a practical application, the term may include a variant of the word, and the specific content is not limited herein.
  • the word-cutting rule is similar to the above-mentioned word-cutting rule, and will not be described again here.
  • the computer device obtains parameters corresponding to the words by calculating: the word frequency tf i,j and the reverse text frequency idf i of each word are calculated by the tf-idf algorithm, and then The parameters corresponding to the word are calculated according to the word frequency tf i,j and the reverse text frequency idf i , and the calculation manner is similar to the foregoing calculation manner, and details are not described herein again.
  • the computer device may obtain the parameter corresponding to the word according to the mapping relationship in the sub-training model corresponding to each text library.
  • Step S205 Determine, according to the sub-training model corresponding to each text library, a large class list corresponding to the text information, and determine feature information of the text information according to the large class list.
  • the computer device may determine a large class list corresponding to the text information according to a projection relationship, a small class list, and the like in the sub-training model corresponding to each text library, and a parameter corresponding to each word, such as Figure 7 shows.
  • the large class list is clustered in the text library 1 to the text library n for text information, and a list of category 1 to category n respectively belonging to the text library 1 to the text library n and composed of category 1 to category n is obtained.
  • n is an integer and n>1.
  • the text information to be identified has a clustering result with a small class list of each text library, and the clustering results of the small class list of each text library are sorted to obtain a large class list.
  • the parameters corresponding to each word are projected on the preset space according to the projection relationship corresponding to each text library, and projection parameters are generated. And, obtaining a centroid of the small class list corresponding to each text library and performing projection generation in the cluster area. Calculate the shortest distance between the projection parameters corresponding to each text library and the centroid, and determine the text information according to the shortest distance corresponding to each text library, and the category to which the subclass corresponding to each text library belongs.
  • Step S206 identifying the type of the templated text to which the text information belongs according to the feature information.
  • the recognition result of the text information may be obtained according to the feature information, as shown in FIG. 7, that is, the type of the templated text to which the text information belongs is identified.
  • text information contains a lot of interference information, and text information is often presented in short text form, which brings great difficulties for word segmentation and part-of-speech analysis, and also reduces the accuracy of part-of-speech analysis.
  • the training model is an unsupervised training model of machine learning.
  • a Boosting SVD algorithm is used to process the training samples, such as word cutting and clustering, so that the training samples of each templated text will be separately Get together and generate a training model.
  • the text information to be recognized by the Boosting SVD algorithm is processed, and the type of the templated text to which the text information belongs can be automatically recognized according to the feature information of the text information to be recognized.
  • the clustering effect is not affected by the result of word segmentation, text length, and interference information.
  • the scheme is equally applicable to long text information and short text information, and has strong versatility and stability. High accuracy; on the other hand, no manual labeling is required, which greatly reduces labor costs; thus solving the problems in the prior art that require a lot of manpower and low recognition accuracy.
  • the embodiment of the present application further provides an apparatus based on the text information processing method.
  • the meaning of the noun is the same as that in the above text information processing method.
  • FIG. 8 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure.
  • the computer device may include a receiving unit 301, a first word-cutting unit 302, a parameter obtaining unit 303, a determining unit 304, and an identifying unit 305. Wait.
  • the receiving unit 301 is configured to receive text information to be identified.
  • the text information processing method can be applied to e-mail, instant messaging (eg, WeChat, QQ, etc.), blog, circle of friends, information push, live broadcast, etc., and the scene information that needs to be recognized by the terminal is recognized. .
  • the receiving unit 301 receives the text information to be recognized, and the text information may be a terminal such as a tablet computer, a mobile phone, or a computer, information sent by email, information sent through instant messaging, information published through a blog, and push displayed through a bullet box. Information, information published through circle of friends, and information displayed through live broadcasts.
  • the text information may include information such as Chinese, English, punctuation or expression, and the specific content is not limited herein.
  • the all-word unit 302 is configured to perform word-cutting processing on the text information received by the receiving unit 301 according to a preset word-cutting rule to generate at least one word.
  • the all-word unit 302 performs a word-cutting process on the text information to be recognized received by the receiving unit 301 according to a preset word-cutting rule
  • the preset word-cutting rule may be a word-cutting according to the preset number of words per interval, for example, , every 2 words are cut into one word, or 1 word per interval is cut into one word.
  • the preset word-cutting rule may also be a uniform word-cutting according to the total number of words of the text information. For example, when the total number of words of a piece of text information is 15, it may be equally divided into one word every 5 words.
  • the preset word-cutting rule may also be a random word-cutting.
  • the text information with a total of 15 words is cut into a word composed of 2 words, a word composed of 1 word, a word composed of 9 words, and a word composed of 3 words.
  • the preset word-cutting rules may be flexibly set according to actual needs, for example, dictionary-based word-cutting, statistics-based word-cutting, or artificial intelligence-based word-cutting, etc., and the specific content is not limited herein. .
  • the word cutting rule of the text information to be recognized may be determined according to the mapping relationship
  • the mapping A relationship is a mapping relationship between a set of words and a set of parameters.
  • At least one word may be generated. As shown in FIG. 3, only the word 1 may be generated, or the word 1 to the word n may be generated, n is an integer, and n>1.
  • the word may be composed of a Chinese character, or may be composed of a plurality of words and other symbols, or may be composed of English.
  • the term may include a variant of the word, and the specific content is not limited herein.
  • Variant words refer to words that are different from normative words. For example, the normative words are “beauty” and the corresponding variants are “ ⁇ ”.
  • the first word-cutting unit 302 may perform word-cutting processing on the text information received by the receiving unit 301 in real time or every preset time, or may perform sampling on the receiving unit 301 to receive a preset amount of text information. Cut word processing.
  • the parameter obtaining unit 303 is configured to acquire parameters corresponding to at least one word, and each parameter identifies a word.
  • the parameter obtaining unit 303 may acquire parameters corresponding to one word, or respectively acquire parameters corresponding to the plurality of words, in FIG. 3, each The words correspond to one parameter.
  • Each parameter identifies a word, which can be a number or a character that uniquely identifies the word. For example, the parameter corresponding to "we” is 0.1, and the parameter corresponding to "I" is 0.5.
  • the computer device pre-stores a training model including a mapping relationship between words and parameters
  • the parameter obtaining unit 303 is configured to acquire parameters corresponding to the at least one word according to the mapping relationship in the training model.
  • the parameter obtaining unit 303 obtains a parameter corresponding to the word by calculating: first, acquiring a target frequency in which the word exists in the text information to be recognized, where the target frequency is that the word exists in the text information to be recognized.
  • M represents the number of occurrences of the word q in the text information Q to be recognized, and X represents the sum of the number of occurrences of all words in the text information Q to be recognized.
  • the text information including the word is in the target reverse text frequency of the plurality of pieces of text information
  • the target reverse text frequency is the text information of the word at the
  • the parameter obtaining unit 303 may preferentially acquire parameters corresponding to at least one word according to the mapping relationship.
  • the word correspondence is calculated according to the target frequency and the target reverse text frequency. Parameters.
  • the determining unit 304 is configured to determine feature information of the text information according to the parameter acquired by the parameter obtaining unit 303 and the preset training model, and the training model is trained by at least one type of templated text.
  • the computer device is pre-configured with a training model that is trained from at least one type of templated text.
  • the training model is trained from at least one type of templated text of erotic information, drug sales information, investment information, pyramid sales information, and the like.
  • ⁇ ], hello ⁇ [D
  • E] has benefits", the variable is " ⁇ ” or " Miss”, and the variable is "D” or "V” or “E”, and the template part is "Look, hello ⁇ have benefits”.
  • the computer device further includes:
  • a sample obtaining unit 306 configured to acquire a plurality of training samples corresponding to the templated text
  • a second word-cutting unit 307 configured to perform a word-cutting process on each training sample acquired by the sample acquiring unit 306 according to a word-cutting rule, to generate a word set including a plurality of words;
  • the processing unit 308 is configured to preprocess the set of words generated by the second word-cutting unit 307 to generate a parameter set, where each parameter in the parameter set is used to identify each word in the word set;
  • the clustering unit 309 is configured to perform clustering processing on the plurality of training samples according to the parameter set generated by the processing unit 308 to generate a text clustering list;
  • the generating unit 310 is configured to generate a training model according to the text clustering list generated by the clustering unit 309.
  • the sample obtaining unit 306 obtains a plurality of training samples corresponding to the templated text, and may randomly acquire the plurality of training samples corresponding to the templated text from the received historical text information, or may be A plurality of training samples are extracted from historical text information of different scenes such as pornographic information, drug selling information, and pyramid selling information, and a plurality of training samples corresponding to the templated text may be created according to different scenarios.
  • the number of the training samples and the manner of obtaining the samples may be flexibly set according to actual needs, and the specific content is not limited herein.
  • the second word-cutting unit 307 After the sample obtaining unit 306 obtains a plurality of training samples, the second word-cutting unit 307 performs a word-cutting process for each training sample according to a preset word-cutting rule, and the preset word-cutting rule can use any word-cutting algorithm.
  • the preset word-cutting rule is consistent with the above-mentioned word-cutting rule for word-cutting processing of the text information, and is not described here.
  • a word set including a plurality of words may be generated, as shown in FIG. It may also be that each training sample corresponds to the word set 1 to the word set n (n>1), and the word set corresponding to the plurality of training samples is formed, and the words included in the word set 1 to the word set n may be one or more, n Is an integer and n>1.
  • a set of words containing 100 words can be generated; if each training sample is cut into 6 words, 600 words can be generated. Set of words.
  • Processing unit 308 then pre-processes the resulting set of words to generate a set of parameters, as shown in FIG. 4, each parameter in the set of parameters is used to identify each word in the set of words. It may also be that each training sample corresponds to the word set 1 to the word set n, respectively corresponding to the parameter set 1 to the parameter set n, and constitutes a parameter set corresponding to the plurality of training samples, where n is an integer and n>1.
  • the processing unit 308 is specifically configured to: acquire a frequency of each word in the word set in each training sample, and a reverse text frequency of the target training sample including the word in the plurality of training samples; The reverse text frequency generates a target parameter corresponding to each word; and generates a parameter set according to the target parameter corresponding to each word.
  • the processing unit 308 performs pre-processing on the set of words, including a term frequency-inverse document frequency (tf-idf) conversion, which is a weighting technique for information retrieval and text mining, which can be used.
  • tf-idf term frequency-inverse document frequency
  • the tf in tf-idf represents the word frequency.
  • the term frequency (tf) refers to the frequency at which a given word appears in the file, that is, a word in this embodiment is The frequency that exists in a training sample.
  • the idf in tf-idf indicates the frequency of the reverse text, which is the normalization of the number of words (ie, the number of occurrences). Since the same word may have a higher number of words in a shorter file in a longer file, Regardless of whether the word is important or not, the text frequency is reversed to prevent the word from being biased towards a longer file.
  • the inverse document frequency (idf) is a measure of the universal importance of a word.
  • the calculation formula of the frequency (ie, word frequency) existing in the training sample dj is:
  • tf i,j represents the word frequency of the word ti in the training sample dj
  • n i,j represents the number of occurrences of the word ti in the training sample dj
  • the inverse text frequency of the target training sample containing the word ti in the plurality of training samples may be obtained by dividing the total number of the plurality of training samples by the number of target training samples containing the word ti, and then obtaining
  • the trader takes the logarithm and the formula is as follows:
  • Idf i represents the reverse text frequency
  • represents the total number of training samples
  • the processing unit 308 can calculate the target corresponding to the word according to the frequency tf i,j and the reverse text frequency idf i
  • the frequency of each word in the word set in each training sample is calculated, and after the reverse text frequency of the target training sample containing the words in the plurality of training samples, each word corresponding can be generated according to the frequency and the reverse text frequency.
  • the target parameter and then generate a parameter set according to the target parameter corresponding to each word.
  • Each word in the word set can form a one-to-one mapping relationship with each parameter in the parameter set.
  • the mapping relationship can be understood as a dictionary.
  • the parameter corresponding to the at least one word can be searched in the dictionary without recalculation.
  • the parameter corresponding to the word needs to be calculated according to the aforementioned tf-idf conversion formula.
  • the clustering unit 309 may perform clustering processing on the plurality of training samples according to the parameter set, and the clustering processing may include a K-means clustering algorithm or a hierarchical clustering algorithm (Balanced Iterative Reducing and Clustering using Hierarchies) , BIRCH), etc., the specific content is not limited herein.
  • the text clustering list may be generated.
  • the text clustering list may include a list formed by one type of clustering text, or It is a plurality of lists including a plurality of types of clustered texts, each of which contains one type of clustered text.
  • the generating unit 310 can generate a training model according to the text clustering list, as shown in FIG.
  • the computer device further includes:
  • the transform unit 311 is configured to perform a transform process on the mapping relationship between the word set and the parameter set, and generate a projection relationship of the mapping relationship on the preset space;
  • the transform unit 311 is specifically configured to generate a sample matrix according to the mapping relationship, where each row vector of the sample matrix is a parameter corresponding to the word obtained after each training sample is processed;
  • a transformation matrix is generated according to the covariance matrix and the diagonal matrix, and the transformation matrix is set as a projection relationship.
  • the transform unit 311 converts the mapping relationship between the word set and the parameter set into the n*p-dimensional sample matrix dataMat, the row number n of the sample matrix represents the number of training samples, and the column number p of the sample matrix represents each training.
  • the number of words generated after the sample is processed by word cutting.
  • each row of the generated matrix needs to be uniform. Since the number of generated words after each word of the training sample can be the same, it can be different, so for the number is different, in order to ensure that the vector length of each line of the generation matrix is consistent, you can use 0 A row vector having a short vector length is complemented, so that the vector length of each row is uniform, and each row of the sample matrix corresponds to a parameter corresponding to the word obtained after each training sample is processed.
  • the covariance matrix X of the sample matrix dataMat is calculated, and the eigenvalues of the sample matrix dataMat are calculated, and a diagonal matrix D is generated according to the eigenvalues, and the diagonal matrix D is a diagonal matrix of (p, p) dimensions, which includes Characteristic values ⁇ 1 , ⁇ 2 , ... ⁇ p .
  • the covariance matrix X can calculate the transformation matrix P by Singular value decomposition (SVD), and the calculation formula is as follows:
  • the transformation matrix P is an orthogonal matrix of (p, p) dimensions, which is the transformation matrix P, and each column of the transformation matrix P is a feature vector of the covariance matrix X.
  • the transformation matrix P can be solved by SVD, and the transformation matrix P is set as a projection relationship of the sample matrix dataMat (ie, the mapping relationship) on the preset space.
  • the preset space may be a principal component space, which is a parameter corresponding to a word of the training sample.
  • P j may be the first j column of the transformation matrix P, that is, P j is a matrix of (p, j) dimensions, and Y j is a matrix of (n, j) dimensions.
  • the inverse mapping relationship may be generated by mapping the transformation from the principal component space to the original space according to the transformation matrix and the projection relationship, and the word corresponding to the parameter may be determined according to the inverse mapping relationship.
  • the generating unit 310 is specifically configured to generate a training model according to the mapping relationship, the projection relationship, and the text clustering list.
  • the mapping relationship between the word set and the parameter set (which may be a sample matrix), the projection relationship of the mapping relationship on the preset space (which may be a transformation matrix), and the training model generated by the text clustering list are stored.
  • the determining unit 304 may determine the feature information of the text information according to the parameter and the training model, and the feature information may include the category of the text information in the text clustering list, the number of texts corresponding to the category, and the text information. Similar to the similarity between the training samples in the text clustering list, the feature information can also be flexibly set according to actual needs, and the specific content is not limited herein.
  • the determining unit 304 includes: a determining subunit 3041 for determining feature information of the text information according to the parameters, the projection relationship in the training model, and the text clustering list in the training model.
  • the determining subunit 3041 is specifically configured to perform a projection process on the preset space according to the projection relationship to generate a projection parameter
  • the category to which the text information belongs in the text clustering list, the number of texts corresponding to the category, and the similarity between the text information and the training samples in the text clustering list are determined according to the shortest distance.
  • the determining subunit 3041 projects the parameters corresponding to the words according to the determined projection relationship on a preset space (for example, a principal component space) to generate a projection parameter. And acquiring a centroid of the text clustering list generated by the projection in the clustering region, the centroid may be one or more.
  • the determining subunit 3041 calculates a distance between the projection parameter and the centroid, and the distance may be a Euclidean distance, a Chebyshev distance, or a Hamming distance, and the specific content is not limited herein.
  • Determine the shortest distance between the projection parameter and the centroid For example, when there is only one centroid, the distance between the centroid and the projection parameter is the shortest distance; when there are multiple centroids, from the plurality of centroids and projection parameters Take the shortest distance between the distances.
  • the category to which the text information belongs in the text clustering list, the number of texts corresponding to the category, and the similarity between the text information and the training samples in the text clustering list may be determined according to the shortest distance.
  • multiple training samples may be allocated to multiple text libraries, and then each training sample in each text library is separately processed, and the processing is performed.
  • the training model corresponding to each text library is followed by the text information according to the training model in each text library.
  • the identifying unit 305 is configured to identify, according to the feature information obtained by the determining unit 304, the type of the templated text to which the text information belongs.
  • the identification unit 305 can obtain the recognition result of the text information according to the feature information, as shown in FIG. 3, that is, the identification unit 305 recognizes the type of the templated text to which the text information belongs, and can be based on the text information.
  • the type of templated text to which it belongs determines whether the text information is intercepted.
  • the templated text may include multiple types. When the text information belongs to any one of the types, the text information may be intercepted; when the text information does not belong to any one of the types, the text information may be performed. Forward to the corresponding terminal.
  • the templated text may include a first type and a second type, the first type is templated text of bad information, and the second type is normal templated text.
  • the text information may be intercepted; when the text information belongs to the second type, the text information may be forwarded to the corresponding terminal.
  • the black industry mainly uses automatic machine to generate templated texts to be automatically sent. Therefore, in order to intercept the information and pornographic information of the sales products sent by the black industry.
  • the computer device can be used to identify the received text information according to the training model, in order to intercept the bad information.
  • the first word-cutting unit 302 performs word-cutting processing on the text information received by the receiving unit 301 by using a preset word-cutting rule, generates at least one word, and is obtained by the parameter acquiring unit.
  • the recognition result can be prevented from being interfered by interference information such as word variants, punctuation marks, and/or other characters, thereby improving the accuracy of identifying the text information. .
  • the embodiment of the present application further provides a server, which can be integrated into the computer device of the embodiment of the present application.
  • a server which can be integrated into the computer device of the embodiment of the present application.
  • FIG. 12 shows a schematic structural diagram of a server involved in the embodiment of the present application, specifically:
  • the server may include one or more processing core processor 401, one or more computer readable storage medium memories 402, power source 403, and input unit 404. It will be understood by those skilled in the art that the server structure illustrated in FIG. 12 does not constitute a limitation to the server, and may include more or less components than those illustrated, or a combination of certain components, or different component arrangements. among them:
  • the processor 401 is the control center of the server, connecting various portions of the entire server using various interfaces and lines, by running or executing software programs and/or modules stored in the memory 402, and recalling data stored in the memory 402, Execute the server's various functions and process data to monitor the server as a whole.
  • the processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor and a modem processor, and the application processor mainly processes an operating system, a user interface, an application, and the like.
  • the modem processor primarily handles wireless communications. In one embodiment, the above described modem processor may also not be integrated into the processor 401.
  • the memory 402 can be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by running software programs and modules stored in the memory 402.
  • the memory 402 can mainly include a storage program area and a storage data area, and the storage program area can store an operating system, an application required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area can be stored according to the server. Use the created data, etc.
  • memory 402 can include high speed random access memory, and can also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, memory 402 can also include a memory controller to provide processor 401 access to memory 402.
  • the server also includes a power source 403 that supplies power to the various components.
  • the power source 403 can be logically coupled to the processor 401 via a power management system to enable management of charging, discharging, and power management functions through the power management system.
  • the power supply 403 may also include any one or more of a DC or AC power source, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.
  • the server can also include an input unit 404 that can be used to receive input numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function controls.
  • an input unit 404 can be used to receive input numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function controls.
  • the server may further include a display unit or the like, and details are not described herein again.
  • the processor 401 in the server loads the executable file corresponding to the process of one or more applications into the memory 402 according to the following instruction, and is stored in the memory by the processor 401.
  • the application in 402 thus implementing various functions, as follows:
  • the processor 401 is further configured to: acquire a plurality of training samples corresponding to the templated text; perform each word-cutting process on each of the training samples according to a word-cutting rule to generate a set of words including a plurality of words; Preprocessing the word set to generate a parameter set.
  • Each parameter in the parameter set is used to identify each word in the word set; clustering the plurality of training samples according to the parameter set to generate a text cluster list; The list generates a training model.
  • the processor 401 is further configured to: obtain a frequency in which each word in the word set exists in each training sample, and a reverse text frequency of the target training sample containing the word in the plurality of training samples; The frequency and the reverse text frequency generate target parameters corresponding to each word; and generate a parameter set according to the target parameters corresponding to each word.
  • the processor 401 is further configured to perform a transformation process on a mapping relationship between the word set and the parameter set, generate a projection relationship of the mapping relationship on the preset space, and generate a training model according to the text clustering list.
  • the steps include: generating a training model according to the mapping relationship, the projection relationship, and the text clustering list.
  • the processor 401 is further configured to determine feature information of the text information according to the parameter, the projection relationship in the training model, and the text clustering list in the training model.
  • the processor 401 is further configured to: perform projection processing on the preset space according to the projection relationship to generate a projection parameter; and obtain a relationship between the projection parameter and the centroid of the cluster region where the text cluster list is located.
  • the shortest distance; the category to which the text information belongs in the text clustering list, the number of texts corresponding to the category, and the similarity between the text information and the training samples in the text clustering list are determined according to the shortest distance.
  • the processor 401 is further configured to: generate a sample matrix according to the mapping relationship, where each row vector of the sample matrix is a parameter corresponding to the word obtained after each training sample is processed; and the sample matrix is obtained.
  • the variance matrix, and the eigenvalues of the sample matrix are obtained, and a diagonal matrix is generated according to the eigenvalues; the transformation matrix is generated according to the covariance matrix and the diagonal matrix, and the transformation matrix is set as a projection relationship.
  • the processor 401 is further configured to: acquire parameters corresponding to the at least one word according to the mapping relationship in the training model.
  • the server provided by the embodiment of the present application performs word segmentation processing on the received text information by using a preset word-cutting rule, generates at least one word, and acquires parameters corresponding to at least one word, and each parameter identifies a word. Then, the feature information of the text information is determined according to the obtained parameters and the preset training model.
  • the training model is trained by at least one type of templated text, and then the type of the templated text to which the text information belongs is identified according to the feature information.
  • the recognition result can be prevented from being interfered by interference information such as word variants, punctuation marks, and/or other characters, thereby improving the accuracy of identifying the text information. .
  • the embodiment of the present application provides a storage medium in which a plurality of instructions are stored, and the instructions can be loaded by the processor to perform the steps in the navigation information processing method provided in the embodiment of the present application.
  • the instruction can perform the following steps:
  • the instruction may perform the following steps: acquiring a plurality of training samples corresponding to the templated text; performing, according to the word-cutting rule, each training sample to perform word-cutting processing to generate a set of words including multiple words; The set performs preprocessing to generate a parameter set. Each parameter in the parameter set is used to identify each word in the word set; clustering the plurality of training samples according to the parameter set to generate a text cluster list; generating a text cluster list according to the text Training model.
  • the storage medium may include a read only memory (ROM), a random access memory (RAM), a magnetic disk or an optical disk, and the like.
  • ROM read only memory
  • RAM random access memory
  • magnetic disk or an optical disk and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种文本信息处理方法、计算机设备及计算机可读存储介质,包括:接收待识别的文本信息(S101),按照预设的切词规则对所述文本信息进行切词处理,生成至少一个词语(S102);获取所述至少一个词语对应的参数,每个参数标识一个词语(S103);根据所述参数及预置的训练模型确定所述文本信息的特征信息(S104),所述训练模型由至少一个类型的模板化文本训练而成;根据所述特征信息确定所述文本信息所属的所述模板化文本的类型(S105)。

Description

文本信息处理方法、计算机设备及计算机可读存储介质
相关申请的交叉引用
本申请要求于2017年11月20日提交中国专利局、申请号为201711159103.2、发明名称为“一种文本信息处理方法、装置及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及通信技术领域,具体涉及一种文本信息处理方法、计算机设备及计算机可读存储介质。
背景技术
文本信息是社交平台的主要信息载体,在需要大量发送类似内容的文本信息时,可以通过模型生成模版化文本并进行发送。当需要对文本信息进行处理时,可以通过对接收到的文本信息进行识别,以便根据识别结果对该文本信息进行相应的处理,例如,对识别出的垃圾信息进行拦截、或者对识别出的色情信息进行屏蔽,等等,因此,如何准确地识别出垃圾信息或色情信息等至关重要。
现有技术中,在发送文本信息,如发送电子邮件、即时通讯信息、博客、朋友圈及直播弹幕等场景中,当接收到文本信息时,首先对该文本信息执行切词及词性分析等特征提取的步骤,提取出一个或多个词语,例如,根据主谓宾进行切词,提取出一个或多个词语。然后,将一个或多个词语传给训练好的模型进行预测处理,由于该模型由模版化文本训练而成,因此,进行预测处理后可以识别出该文本信息所属的模版化文本的类型,即识别出该文本信息是通过哪种类型的模型生成的模版化文本。例如,是属于垃圾信息还是属于色情信息等。
由于现有技术的方案十分依赖于词性分析的准确度,而对于采用了干扰信息的文本信息而言,其词性分析的准确度均较低,所以,现有方案对文本 信息识别的准确性并不高。
发明内容
有鉴于此,本申请的各种实施例,提供了一种文本信息处理方法、计算机设备及计算机可读存储介质。
一种文本信息处理方法,该方法由计算机设备实施,包括:
接收待识别的文本信息,按照预设的切词规则对所述文本信息进行切词处理,生成至少一个词语;
获取所述至少一个词语对应的参数,每个参数标识一个词语;
根据所述参数及预置的训练模型确定所述文本信息的特征信息,所述训练模型由至少一个类型的模板化文本训练而成;及
根据所述特征信息确定所述文本信息所属的所述模板化文本的类型。
一种计算机设备,包括处理器和存储器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行以下步骤:
接收待识别的文本信息,按照预设的切词规则对所述文本信息进行切词处理,生成至少一个词语;
获取所述至少一个词语对应的参数,每个参数标识一个词语;
根据所述参数及预置的训练模型确定所述文本信息的特征信息,所述训练模型由至少一个类型的模板化文本训练而成;及
根据所述特征信息确定所述文本信息所属的所述模板化文本的类型。
一种非易失性的计算机可读存储介质,存储有计算机可读指令,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:
接收待识别的文本信息,按照预设的切词规则对所述文本信息进行切词处理,生成至少一个词语;
获取所述至少一个词语对应的参数,每个参数标识一个词语;
根据所述参数及预置的训练模型确定所述文本信息的特征信息,所述训练模型由至少一个类型的模板化文本训练而成;及
根据所述特征信息确定所述文本信息所属的所述模板化文本的类型。
计算机设备在一个实施例中在一个实施例中
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征、目的和优点将从说明书、附图以及权利要求书变得明显。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。
图1是本申请实施例提供的文本信息处理系统的场景示意图;
图2是本申请实施例提供的文本信息处理方法的流程示意图;
图3是本申请实施例提供的文本信息处理方法的另一流程示意图;
图4是本申请实施例提供的文本信息处理方法的另一流程示意图;
图5是本申请实施例提供的文本信息处理方法的另一流程示意图;
图6是本申请实施例提供的文本信息处理方法的另一流程示意图;
图7是本申请实施例提供的文本信息处理方法的另一流程示意图;
图8是本申请实施例提供的计算机设备的硬件结构示意图;
图9是本申请实施例提供的计算机设备的硬件结构示意图;
图10是本申请实施例提供的计算机设备的硬件结构示意图;
图11是本申请实施例提供的计算机设备的硬件结构示意图;
图12是本申请实施例提供的服务器的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
在以下的说明中,本申请的具体实施例将参考由一部或多部计算机所执行的步骤及符号来说明,除非另有述明。因此,这些步骤及操作将有数次提 到由计算机执行,本文所指的计算机执行包括了由代表了以一结构化型式中的数据的电子信号的计算机处理单元的操作。此操作转换该数据或将其维持在该计算机的内存系统中的位置处,其可重新配置或另外以本领域测试人员所熟知的方式来改变该计算机的运作。该数据所维持的数据结构为该内存的实体位置,其具有由该数据格式所定义的特定特性。但是,本申请原理以上述文字来说明,其并不代表为一种限制,本领域测试人员将可了解到以下所述的多种步骤及操作亦可实施在硬件当中。
本申请实施例提供一种文本信息处理方法、装置及存储介质。
请参阅图1,图1为本申请实施例所提供的文本信息处理系统的场景示意图,该文本信息处理系统可以包括文本信息处理装置,该文本信息处理装置具体可以集成在服务器中,主要用于接收待识别的文本信息,按照预设的切词规则对文本信息进行切词处理,生成至少一个词语。然后,获取至少一个词语对应的参数,每个参数标识一个词语,再根据参数及预置的训练模型确定文本信息的特征信息,该训练模型由至少一个类型的模板化文本训练而成。最后,根据特征信息可确定识别结果,即根据特征信息识别文本信息所属的模板化文本的类型,等等。
此外,该文本信息处理系统还包括存储器,用于存储训练模型,服务器可以预先根据训练样本训练得到训练模型,并将该训练模型存储至存储器,以便后续需要对待识别的文本信息进行识别时,可以直接从存储器中获取训练模型对待识别的文本信息进行识别。
该文本信息处理系统还包括一个终端(例如,终端A)或多个终端(例如,终端A、终端B及终端C等),该终端可以是平板电脑、手机、笔记本电脑、台式电脑等具备储存单元并安装有微处理器而具有运算能力的终端。该终端可以向计算机设备发送待识别的文本信息,以使得计算机设备对接收到的待识别的文本信息进行相应的处理。或者是,该终端可以向计算机设备发送多条训练样本,以使得计算机设备对接收到的多条训练样本进行训练,生成训练模型,等等。
以下分别进行详细说明。
在本实施例中,将从计算机设备的角度进行描述,该计算机设备具体可以集成在服务器或网关等网络设备中。
一种文本信息处理方法,包括:接收待识别的文本信息;按照预设的切词规则对文本信息进行切词处理,生成至少一个词语;获取至少一个词语对应的参数,每个参数标识一个词语;根据参数及预置的训练模型确定文本信息的特征信息,训练模型由至少一个类型的模板化文本训练而成;根据特征信息识别文本信息所属的模板化文本的类型。
请参阅图2,图2是本申请第一实施例提供的文本信息处理方法的流程示意图。该文本信息处理方法包括:
在步骤S101中,接收待识别的文本信息。
本实施例中,文本信息处理方法可以应用在电子邮件、即时通讯(例如,微信、QQ等)、博客、朋友圈、信息推送及直播弹幕等,需要对终端发送的文本信息进行识别的场景。
计算机设备接收待识别的文本信息,该文本信息可以是平板电脑、手机、电脑等终端,通过电子邮件发送的信息、通过即时通讯发送的信息、通过博客发表的信息、通过弹框显示的推送信息、通过朋友圈发表的信息的及通过直播弹幕显示的信息等。该文本信息可以包括中文、英文、标点符号或表情等信息,具体内容在此处不作限定。
在步骤S102中,按照预设的切词规则对文本信息进行切词处理,生成至少一个词语。
计算机设备按照预设的切词规则,对接收到的待识别文本信息进行切词处理,该预设的切词规则可以是按照每间隔预设字数进行切词,例如,每间隔2个字切为一个词语,或者是每间隔1个字切为一个词语。该预设的切词规则也可以是按照文本信息的总字数进行均匀切词,例如,当某条文本信息的总字数为15个时,可以均分每隔5个字切为一个词语。该预设的切词规则还可以是随机切词,例如,当某条文本信息的总字数为15个时,从中仅提取出3组2个字组成的词语。或者是,将总字数为15个的文本信息,切割为一个2个字组成的词语,一个1个字组成的词语,一个9个字组成的词语,以及一个3个字组成的词语。
在一个实施例中,该预设的切词规则可根据实际需要进行灵活设置,例如,基于字典的切词、基于统计的切词或基于人工智能的切词等,具体内容在此处不作限定。
需要说明的是,对待识别的文本信息进行切词时,若需要保证切得的词语与映射关系中存储的词语一致,此时,可以根据映射关系确定对待识别文本信息的切词规则,该映射关系为词语集与参数集之间的映射关系。例如,多条训练样本中存在某条训练样本“一一二二三三”每隔两个字的切词规则,得到“一一”、“二二”及“三三”,对于接收到的待识别的文本信息“一一一二二三三”,可以切为“一”、“一一”、“二二”及“三三”,这样就可以保证得到的“一一”、“二二”及“三三”与映射关系中存储的一致。
对文本信息进行切词处理后,可以生成至少一个词语,如图3所示,可以是只生成词语1,也可以是生成词语1至词语n等,n为整数,且n>1。该词语可以是由一个中文字组成,也可以是由多个字及其他符号组成,还可以是由英文组成。在一个实施例中,在实际应用中,该词语可以包括变种的词语,具体内容在此处不作限定。变种的词语是指采用有异于规范词语表达的词语,例如,规范词语为“美女”,对应变种的词语为“渼汝”等。
需要说明的是,计算机设备可以是实时或每隔预设时间对接收到的文本信息进行切词处理,或者是抽样对接收到预设数量的文本信息进行切词处理。
在步骤S103中,获取至少一个词语对应的参数。
在对文本信息进行切词处理,生成一个或多个词语后,可以获取一个词语对应的参数,或分别获取多个词语对应的参数,图3中,每个词语对应一个参数。每个参数标识一个词语,该参数可以是一个数字,也可以是唯一标识词语的字符等。例如,“我们”对应的参数为0.1,“我”对应的参数为0.5。
在某些实施方式中,计算机设备预先存储有训练模型,该训练模型包括词语与参数之间的映射关系,获取至少一个词语对应的参数的步骤可以包括:根据训练模型中的映射关系获取至少一个词语对应的参数。
在某些实施方式中,计算机设备通过计算获取词语对应的参数:首先,获取词语在待识别的文本信息中存在的目标频率,该目标频率即为该词语在待识别的文本信息中存在的频率,例如,对于在某条待识别的文本信息Q中的词语q,词语q在该条待识别的文本信息Q中存在的目标频率的计算公式为:Y=M/X,Y表示词语q在待识别的文本信息Q中的目标频率,M表示词语q在待识别的文本信息Q中出现的次数,X表示在待识别的文本信息Q中所有词语出现的次数之和。
以及,获取在预设时间段内接收到的多条文本信息中,包含该词语的文本信息在该多条文本信息的目标逆向文本频率,该目标逆向文本频率为该词语的文本信息在该多条文本信息的逆向文本频率,其计算公式为:S=log(R/T),S表示目标逆向文本频率,R表示多条文本信息的总数目,T表示包含词语a的目标文本信息的数目,log为对数函数。然后,根据目标频率及目标逆向文本频率生成该词语对应的参数,其计算公式为:H=Y×S。
需要说明的是,计算机设备也可以优先根据映射关系获取至少一个词语对应的参数,当该映射关系中不存在至少一个词语对应的参数时,再根据目标频率及目标逆向文本频率计算词语对应的参数。
在步骤S104中,根据参数及预置的训练模型确定文本信息的特征信息。
计算机设备预先设置有训练模型,该训练模型由至少一个类型的模板化文本训练而成。例如,该训练模型由色情信息、卖药信息、投资信息、传销信息等类型中的至少一个类型的模板化文本训练而成。
模板化文本可以为包括变量及模板部分等的文本信息。例如,“看渼汝,你好=丫丫丫丫D有福利”,“看小姐,你好=丫丫丫丫V有福利”,“看小姐,你好=丫丫丫丫E有福利”,这三条文本信息中,可以是由“看[渼汝|小姐],你好=丫丫丫丫[D|V|E]有福利”组成的模板化文本,变量为“渼汝”或“小姐”,以及变量为“D”或“V”或“E”,模板部分为“看,你好=丫丫丫丫有福利”。
在某些实施方式中,训练模型生成的步骤可包括:
步骤(1)获取模板化文本对应的多条训练样本;
步骤(2)按照预设的切词规则将每条训练样本分别进行切词处理,生成包含多个词语的词语集;
步骤(3)对词语集进行预处理,生成参数集,参数集中的每个参数用于标识词语集中的每个词语;
步骤(4)根据参数集对多条训练样本进行聚类处理,生成文本聚类列表;
步骤(5)根据文本聚类列表生成训练模型。
为了有针对性地进行训练,获取模板化文本对应的多条训练样本的方式,可以从接收到的历史文本信息中,随机获取模板化文本对应的多条训练样本,也可以是从色情信息、卖药信息、传销信息等不同场景的历史文本信息中抽取多条训练样本,还可以是根据不同场景制造出模板化文本对应的多条训练 样本。在一个实施例中,训练样本的条数及获取方式可以根据实际需要进行灵活设置,具体内容在此处不作限定。
在获取到多条训练样本后,按照预设的切词规则将每条训练样本分别进行切词处理,该预设的切词规则可以使用任何切词算法,为了提高对文本信息进行处理的可靠性,该预设的切词规则与前述提到的对文本信息进行切词处理的切词规则是一致的,此处不赘述。
对多条训练样本进行切词处理后,可以生成包含多个词语的词语集,如图4所示。还可以是每条训练样本对应词语集1至词语集n,组成多条训练样本对应的词语集,词语集1至词语集n中包含的词语可以是一个或多个,n为整数,且n>1。
例如,当100条训练样本中,若每条训练样本均提取出一个词语,则可以生成包含100个词语的词语集;若每条训练样本均切为6个词语,则可以生成包含600个词语的词语集。
然后,对得到的词语集进行预处理,生成参数集,如图4所示,参数集中的每个参数用于标识词语集中的每个词语。还可以是每条训练样本对应词语集1至词语集n,分别对应的参数集1至参数集n,组成多条训练样本对应的参数集,n为整数,且n>1。
在一个实施例中,对词语集进行预处理,生成参数集的步骤可以包括:获取词语集中每个词语在每条训练样本中存在的频率,以及包含词语的目标训练样本在多条训练样本中的逆向文本频率;根据频率及逆向文本频率生成每个词语对应的目标参数;根据每个词语对应的目标参数生成参数集。
对词语集进行预处理包括对词语集进行加权算法(term frequency–inverse document frequency,tf-idf)转换,该tf-idf是一种用于信息检索与文本挖掘的加权技术,可以用来评估一个词语对于一条文本信息,或对于多条训练样本中的其中一条训练样本的重要程度。词语的重要性随着它在文本信息中出现的次数成正比增加,随着它在多条训练样本中出现的频率成反比下降。
tf-idf中的tf表示词频,在一份给定的文件里,词频(term frequency,tf)指的是某一个给定的词语在该文件中出现的频率,即本实施例中一个词语在一条训练样本中存在的频率。tf-idf中的idf表示逆向文本频率,是对词语的数量(即出现次数)进行归一化,由于同一个词语在较长的文件里可能会比较 短的文件里有更高的词数,而不管该词语重要与否,因此,逆向文本频率以防止词数偏向较长的文件。
对于在某条训练样本dj中的词语ti,其在该条训练样本dj中存在的频率(即词频)的计算公式为:
Figure PCTCN2018114188-appb-000001
以上式子中,tf i,j表示词语ti在训练样本dj中的词频,n i,j表示词语ti在训练样本dj中出现的次数,∑ kn k,j表示在训练样本dj中所有词语出现的次数之和。例如,当将训练样本dj切为3个词语时,k=3,∑ kn k,j表示在训练样本dj中这3个词语出现的次数之和。
逆向文本频率(inverse document frequency,idf)是一个词语普遍重要性的度量。对于词语的ti,包含词语ti的目标训练样本在多条训练样本中的逆向文本频率,可以由多条训练样本的总数目,除以包含该词语ti的目标训练样本的数目,再将得到的商取对数得到,其计算公式如下:
Figure PCTCN2018114188-appb-000002
idf i表示逆向文本频率,|D|表示多条训练样本的总数目,|{j:t i∈d j}|表示包含词语ti的目标训练样本的数目(即n i,j!=0的训练样本数目)。
由于如果该词语ti不在多条训练样本中,就会导致分母为零,因此,可以使用以下计算公式:
Figure PCTCN2018114188-appb-000003
在得到词语ti在某条训练样本dj中存在的频率tf i,j,以及逆向文本频率idf i后,可以根据该频率tf i,j及逆向文本频率idf i计算该词语对应的目标参数a,其计算公式为:a=tf i,j×idf i
按照上述方法计算词语集中每个词语在每条训练样本中存在的频率,以及包含词语的目标训练样本在多条训练样本中的逆向文本频率后,可以根据频率及逆向文本频率生成每个词语对应的目标参数,然后根据每个词语对应的目标参数生成参数集。
词语集中的每个词语与参数集中的每个参数之间可以形成一一对应的映 射关系。该映射关系可以理解为字典,在对待识别的文本信息进行切词处理得到至少一个词语后,可以在该字典中查找该至少一个词语对应的参数,而不需要重新计算。或者是,当该字典中不存在某个词语对应的参数时,需要根据前述的tf-idf转换公式计算这个词语对应的参数。
在得到参数集后,可以根据参数集对多条训练样本进行聚类处理,该聚类处理可以包括K-means聚类算法或层次聚类算法(Balanced Iterative Reducing and Clustering using Hierarchies,BIRCH)等,具体内容在此处不作限定。
根据参数集对多条训练样本进行聚类处理后,可以生成文本聚类列表,图4中,该文本聚类列表中可以包括一种类型的聚类文本形成的一个列表,或者是包括多种类型的聚类文本形成对应的多个列表,每个列表包含一种类型的聚类文本。最后,可以根据文本聚类列表生成训练模型,如图4所示。
在一个实施例中,对词语集进行预处理,生成参数集的步骤之后,文本信息处理方法还包括:对词语集与参数集之间的映射关系进行变换处理,生成映射关系在预设空间上的投影关系。
在一个实施例中,对词语集与参数集之间的映射关系进行变换处理,生成映射关系在预设空间上的投影关系的步骤包括:
根据映射关系生成样本矩阵,其中样本矩阵的每行向量为每条训练样本切词处理后得到的词语对应的参数;
获取样本矩阵的协方差矩阵,以及获取样本矩阵的特征值,根据特征值生成对角矩阵;
根据协方差矩阵及对角矩阵生成转换矩阵,将转换矩阵设定为投影关系。
首先,将词语集与参数集之间的映射关系转变为n*p维的样本矩阵dataMat,样本矩阵的行数n表示训练样本的条数,样本矩阵的列数p表示每条训练样本进行切词处理后生成词语的个数。
需要说明的是,为了能够使得映射关系以矩阵的形式呈现,生成矩阵的每行向量长度需要一致。由于每条训练样本进行切词处理后生成词语的个数可以是一样的,也可以是不一样的,因此对于个数不一样的,为了保证生成矩阵每行的向量长度一致,可以用0将向量长度较短的某行向量补齐,从而可以使得每行的向量长度一致,样本矩阵的每行向量对应为每条训练样本切词 处理后得到的词语所对应的参数。
然后,计算样本矩阵dataMat的协方差矩阵X,以及计算样本矩阵dataMat的特征值,并根据特征值生成对角矩阵D,对角矩阵D是一个(p,p)维的对角矩阵,包含了特征值λ 1,λ 2,......λ p
此时,协方差矩阵X可以通过奇异值分解(Singular value decomposition,SVD)计算转换矩阵P,其计算公式如下:
X=PDP T
P是一个(p,p)维的正交矩阵,该正交矩阵即为转换矩阵P,转换矩阵P的每一列都是协方差矩阵X的特征向量。通过SVD可求解出转换矩阵P,将转换矩阵P设定为样本矩阵dataMat(即映射关系)在预设空间上的投影关系。该预设空间可以是主成分空间,该主成分空间为对训练样本的词语所对应的参数。转换矩阵P在主成分空间的投影可以表示为:Y=dataMat×P,Y表示投影关系。
需要说明的是,投影关系也可以是只在样本矩阵dataMat的部分维度上进行的投影,若只使用部分维度top-j主成分,则投影之后的投影关系为:Y j=dataMat×P j,Y j表示部分投影关系,P j表示转换矩阵P的部分维度组成的矩阵。例如,P j可以是转换矩阵P的前j列,也就是说P j是一个(p,j)维的矩阵,Y j是一个(n,j)维的矩阵。
在某些实施方式中,可以根据转换矩阵及投影关系,通过拉回映射从主成分空间映射到原始空间,生成逆映射关系,即可根据逆映射关系确定参数对应的词语。通过拉回映射重构之后得到的逆映射关系是:R j=Y j×(P j) T,R j是使用部分维度top-j的主成分,进行重构之后形成的逆映射关系是一个(n,p)维的矩阵。
在一个实施例中,在确定映射关系及投影关系后,根据文本聚类列表生成训练模型的步骤可以包括:根据映射关系、投影关系及文本聚类列表生成训练模型。即将词语集与参数集之间的映射关系(可以是样本矩阵)、映射关系在预设空间上的投影关系(可以是转换矩阵)及文本聚类列表生成的训练模型进行存储。
在确定词语对应的参数后,计算机设备可以根据参数及训练模型确定文本信息的特征信息,该特征信息可以包括文本信息在文本聚类列表中所属的 类别、类别对应的文本数量、以及文本信息与文本聚类列表中训练样本之间相似度等,该特征信息还可以根据实际需要进行灵活设置,具体内容在此处不作限定。
在一个实施例中,根据参数及预置的训练模型确定文本信息的特征信息的步骤可以包括:根据参数、训练模型中的投影关系及训练模型中的文本聚类列表确定文本信息的特征信息。
在某些实施方式中,根据参数、训练模型中的投影关系及训练模型中的文本聚类列表确定文本信息的特征信息的步骤可以包括:
根据投影关系将参数在预设空间上进行投影处理,生成投影参数;
获取投影参数与文本聚类列表所在聚类区域的质心之间的最短距离;
根据最短距离确定文本信息在文本聚类列表中所属的类别、类别对应的文本数量、以及文本信息与文本聚类列表中训练样本之间相似度。
首先将词语对应的参数按照确定的投影关系,在预设空间(例如,主成分空间)上进行投影,生成投影参数。以及,获取文本聚类列表在聚类区域内进行投影生成的质心,该质心可以是一个或者是多个。
然后,计算投影参数与该质心之间距离,该距离可以是欧式距离、切比雪夫距离或汉明距离等,具体内容在此处不作限定。再确定投影参数与质心之间的最短距离,例如,当只存在一个质心时,该质心与投影参数之间的距离即为最短距离;当存在多个质心时,从多个质心与投影参数之间的距离中取最短距离。
某个质心与投影参数之间的距离越短,说明该某个质心对应的文本聚类列表中的训练样本,与待识别的文本信息之间的相似度越高。在确定最短距离后,可以根据最短距离确定文本信息在文本聚类列表中所属的类别、类别对应的文本数量、以及文本信息与文本聚类列表中训练样本之间相似度等。
在一个实施例中,为了降低计算的复杂度,可以将多条训练样本分配至多个文本库中,然后,分别对每个文本库中的每条训练样本进行切词、聚类等处理,生成每个文本库对应的训练模型,后续再根据每个文本库中的训练模型对文本信息进行识别。
在步骤S105中,根据特征信息识别文本信息所属的模板化文本的类型。
在确定文本信息的特征信息后,可以根据特征信息得到对文本信息的识 别结果,如图3所示,即识别出文本信息所属的模板化文本的类型,可以根据文本信息所属的模板化文本的类型确定是否将该文本信息拦截。例如,模板化文本可以包括多种类型,当文本信息属于其中的任意一种类型时,可以将该文本信息进行拦截;当文本信息不属于其中的任意一种类型时,可以将该文本信息进行转发至对应的终端。
需要说明的是,模板化文本可以包括第一种类型和第二种类型,第一种类型为不良信息的模板化文本,第二种类型为正常的模板化文本。当文本信息属于第一种类型时,可以将该文本信息进行拦截;当文本信息属于第二种类型时,可以将该文本信息进行转发至对应的终端。
由于文本信息是社交平台的主要信息载体,同时也是黑色产业传播不良信息的主要渠道,黑色产业主要使用自动机生成模版化文本自动发送,因此,为了拦截黑色产业发送的推销产品的信息、色情信息等不良信息,可以使用计算机设备根据训练模型对接收到的文本信息进行识别,以便拦截不良信息。
由上述可知,本申请实施例提供的文本信息处理方法,通过预设的切词规则对接收到的文本信息进行切词处理,生成至少一个词语,并获取至少一个词语对应的参数,每个参数标识一个词语;然后,根据得到的参数及预置的训练模型确定文本信息的特征信息,训练模型由至少一个类型的模板化文本训练而成,再根据特征信息识别文本信息所属的模板化文本的类型。由于该方案在整个过程中不需要进行词性分析,因此,可以使得识别结果不会受到词语变种、标点符号、和/或其他字符等干扰信息的干扰,从而提高了对文本信息进行识别的准确性。
根据上述实施例所描述的方法,以下将举例作进一步详细说明。
首先,本申请实施例提供文本信息处理方法,计算机设备可以预先将获取到的多条训练样本分配至多个文本库中,然后,分别对多个文本库中的每条训练样本进行切词及聚类等处理,生成每个文本库对应的子训练模型。最后,在接收到待识别的文本信息时,可以根据每个文本库对应的子训练模型对文本信息进行识别。
请参阅图5,图5为本申请实施例提供的文本信息处理方法的流程示意图。该方法流程可以包括:
步骤S201、获取模板化文本对应的多条训练样本,将多条训练样本分配 至多个文本库。
由于当训练样本的条数增加时,每条训练样本的进行切词处理生成的词语个数增多,对应生成参数的个数也相应增多,通过算法对参数进行处理,生成训练模型过程中,其计算复杂度较大。例如,根据词语集与参数集之间的映射关系生成的n*p维样本矩阵,当训练样本的条数n增加时,样本矩阵dataMat的维度p也会增加,使得SVD算法的复杂度增大。因此,本实施例中,采用Boosting SVD算法,将多条训练样本分配至多个文本库,分别对每个文本库中的文本信息进行处理。例如,对每个库分别通过SVD算法进行计算,由此可以大大降低了计算复杂度。
Boosting SVD算法是集合分类Boosting算法与SVD算法的结合,Boosting算法是一种用来提高弱分类算法准确度的算法,这种算法通过构造一个预测函数系列,然后以一定的方式将预测函数系列组合成一个预测函数。也就是说,Boosting算法也是一种框架算法,主要是通过对样本集的操作获得样本子集,然后用弱分类算法在样本子集上训练生成一系列的基分类器。正是借用Boosting算法的思维,本实施例将多条训练样本分配至多个文本库中,然后,分别对每个文本库中的训练样本进行切词及聚类等处理,生成每个文本库对应的子训练模型,再利用每个文本库对应的子训练模型对文本信息进行识别。
在获取到模板化文本对应的多条训练样本后,可以将多条训练样本分配至多个文本库,如图6所示,多个文本库可以包括文本库1至文本库n,n为整数,且n>1。为了有针对性地进行训练,可以是从色情信息、卖药信息、传销信息等不同场景的历史文本信息中抽取多条训练样本,还可以是根据不同场景制造出模板化文本对应的多条训练样本。在一个实施例中,训练样本的条数及获取方式可以根据实际需要进行灵活设置,具体内容在此处不作限定。
每个文本库中训练样本可以是随机分配的,也可以是根据不同场景的模板化文本进行分配的,例如,文本库1分配的是色情信息对应的训练样本,文本库2分配的是卖药信息对应的训练样本等,具体内容在此处不作限定。
步骤S202、对每个文本库的每条训练样本分别进行第一预处理,获取每个文本库分别对应的映射关系、投影关系及小类列表。
该第一预处理包括切词处理、获取词语对应的参数及聚类处理等。首先,按照预设的切词规则将每个文本库的每条训练样本分别进行切词处理,生成 每每个文本库对应的词语集,此处的切词规则与前述提及的切词规则是一致的,此处不赘述。
然后,获取每个文本库中词语集对应的参数集,如图6中的参数集1至参数集n。词语集对应的参数集的获取方式,可以是通过tf-idf算法计算得到每个词语的词频tf i,j及逆向文本频率idf i,再根据词频tf i,j及逆向文本频率idf i计算该词语对应的参数,其计算方式与前述计算方式类似,此处不再赘述。在计算得到每个文本库对应的每个词语的参数后,可以生成每个文本库对应的参数集。
词语集中的每个词语与参数集中的每个参数之间可以形成一一对应的映射关系,即每个文本库中对应的词语集与参数集均可形成映射关系。
在得到每个文本库对应的参数集后,可以根据每个文本库的参数集,分别对每个文本库中的多条训练样本进行文本聚类,生成小类列表,如图6所示。该文本聚类可以包括K-means聚类算法或BIRCH聚类算法等,具体内容在此处不作限定。每个小类列表可以包括一种类型的聚类文本形成的一个列表,或者是包括多种类型的聚类文本形成对应的多个列表。
其次,对每个文本库中词语集与参数集之间的映射关系进行变换处理,生成映射关系在预设空间上的投影关系。针对每个文本库对应的该投影关系的计算方式与前述计算方式类似,此处不再赘述。
需要说明的是,投影关系的计算采用的Boosting SVD算法,即针对每个文本库中均采用SVD算法进行计算,这样在SVD计算阶段大大降低了计算复杂度,而通过Boosting算法又使每个文本库对应的多个SVD结果生成一个统一的结果,加强了精确度。Boosting SVD算法可以有效解决SVD在大数据上准确度下降、计算复杂度高等问题,提高了计算准确率及降低了复杂度低。
步骤S203、根据映射关系、投影关系及小类列表生成每个文本库对应的子训练模型。
在确定每个文本库对应的词语集与参数集之间的映射关系、映射关系在预设空间上的投影关系及小类列表后,可以根据映射关系、投影关系及小类列表生成每个文本库对应的子训练模型,如图6所示,例如,可以生成子训练模型1至子训练模型n,n为整数,且n>1。
步骤S204、接收待识别的文本信息,对文本信息进行第二预处理。
第二预处理包括切词处理及获取词语对应的参数等,计算机设备接收待识别的文本信息,该文本信息可以是平板电脑、手机、电脑等终端,通过发给另一个终端的信息等。该文本信息可以包括中文、英文、标点符号或表情等信息,具体内容在此处不作限定。
例如,终端A通过计算机设备向终端B发送一封邮件,此时计算机设备接收该邮件,并对该邮件中包含的文本信息进行第二预处理。又例如,终端C通过计算机设备向多个终端1至终端n(其中n为大于2的整数)发送推广信息,此时计算机设备接收该推广信息,并对推广信息进行第二预处理。
如图7所示,首先,计算机设备按照预设的切词规则,对接收到的待识别文本信息进行切词处理,生成至少一个词语。可以是只生成词语1,也可以是生成词语1至词语n等,n为整数,且n>1。
该词语可以是由一个中文字组成,也可以是由多个字及其他符号组成,还可以是由英文组成。在一个实施例中,在实际应用中,该词语可以包括变种的词语,具体内容在此处不作限定。该切词规则与前述提及的切词规则类似,此处不再赘述。
然后,获取每个词语对应的参数,在一个实施例中,计算机设备通过计算获取词语对应的参数:通过tf-idf算法计算得到每个词语的词频tf i,j及逆向文本频率idf i,再根据词频tf i,j及逆向文本频率idf i计算该词语对应的参数,其计算方式与前述计算方式类似,此处不再赘述。
或者是,计算机设备可以根据每个文本库对应的子训练模型中的映射关系获取词语对应的参数。
步骤S205、根据每个文本库对应的子训练模型,确定文本信息对应的大类列表,根据大类列表确定文本信息的特征信息。
在确定每个词语对应的参数后,计算机设备可以根据每个文本库对应的子训练模型中的投影关系、小类列表等,以及每个词语对应的参数确定文本信息对应的大类列表,如图7所示。该大类列表为文本信息在文本库1至文本库n中进行聚类,得到在文本库1至文本库n中分别所属的类别1至类别n,并由类别1至类别n组成的列表,n为整数,且n>1。使得待识别的文本信息都有与每个文本库的小类列表的聚类结果,并对每个文本库的小类列表的聚类结果进行排序,得到大类列表。
将每个词语对应的参数与按照每个文本库对应的投影关系,在预设空间上进行投影,生成投影参数。以及,获取每个文本库对应的小类列表在聚类区域内进行投影生成的质心。计算每个文本库对应的投影参数与该质心之间的最短距离,根据每个文本库对应的最短距离确定文本信息,在每个文本库对应的小类列表中所属的类别。根据每个文本库对应的类别生成大类列表,然后,根据大类列表确定文本信息的特征信息,该特征信息包括文本信息在大类列表中所属的类别、类别对应的文本数量、以及文本信息与小列表中训练样本之间相似度等。
步骤S206、根据特征信息识别文本信息所属的模板化文本的类型。
在确定文本信息的特征信息后,可以根据特征信息得到对文本信息的识别结果,如图7所示,即识别出文本信息所属的模板化文本的类型。
现有技术中,除了相应对接收到的文本信息进行词性分析,导致对文本信息识别的准确性并不高之外,在训练阶段需要对训练样本进行切词及词性分析等特征提取,然后,需要人工给每一条训练样本标注其主题,之后再给模型(例如,深度神经元网络)进行训练。由于需要人工为训练样本标注主题,因此,人工收集大量待标注主题的文本信息十分困难,而且由于变种词语出现频率较快,需要一直持续的收集,耗费大量的人力。另外,由于黑色产业的对抗,文本信息中含有大量干扰信息,文本信息也多呈现短文本形式,这为切词与词性分析带来巨大的困难,也会降低词性分析的准确度。
本申请实施例中训练模型是无监督的机器学习的训练模型,在训练阶段采取一种Boosting SVD算法对训练样本进行切词、聚类等处理,这样每种模板化文本的训练样本将被分别聚到一起,生成训练模型。后续在接收到待识别的文本信息时,用Boosting SVD算法对待识别的文本信息进行处理,可以根据待识别的文本信息的特征信息自动识别出文本信息所属的模版化文本的类型。一方面,无需对进行词性分析,聚类效果不受切词的结果、文本长度、以及干扰信息等影响,该方案在长文本信息和短文本信息上同样适用,通用性及稳定性强,识别准确性高;另一方面,无需人工标注,大大减轻了人力成本;从而解决了现有技术中需要耗费大量的人力及识别准确度低等问题。
为便于更好的实施本申请实施例提供的文本信息处理方法,本申请实施例还提供一种基于上述文本信息处理方法的装置。其中名词的含义与上述文 本信息处理的方法中相同,具体实现细节可以参考方法实施例中的说明。
请参阅图8,图8为本申请实施例提供的计算机设备的结构示意图,其中所述计算机设备可以包括接收单元301、第一切词单元302、参数获取单元303、确定单元304及识别单元305等。
接收单元301,用于接收待识别的文本信息。
本实施例中,文本信息处理方法可以应用在电子邮件、即时通讯(例如,微信、QQ等)、博客、朋友圈、信息推送及直播弹幕等,需要对终端发送的文本信息进行识别的场景。
接收单元301接收待识别的文本信息,该文本信息可以是平板电脑、手机、电脑等终端,通过电子邮件发送的信息、通过即时通讯发送的信息、通过博客发表的信息、通过弹框显示的推送信息、通过朋友圈发表的信息的及通过直播弹幕显示的信息等。该文本信息可以包括中文、英文、标点符号或表情等信息,具体内容在此处不作限定。
第一切词单元302,用于按照预设的切词规则对接收单元301接收到的文本信息进行切词处理,生成至少一个词语。
第一切词单元302按照预设的切词规则,对接收单元301接收到的待识别文本信息进行切词处理,该预设的切词规则可以是按照每间隔预设字数进行切词,例如,每间隔2个字切为一个词语,或者是每间隔1个字切为一个词语。该预设的切词规则也可以是按照文本信息的总字数进行均匀切词,例如,当某条文本信息的总字数为15个时,可以均分每隔5个字切为一个词语。该预设的切词规则还可以是随机切词,例如,当某条文本信息的总字数为15个时,从中仅提取出3组2个字组成的词语。或者是,将总字数为15个的文本信息,切割为一个2个字组成的词语,一个1个字组成的词语,一个9个字组成的词语,以及一个3个字组成的词语。
在一个实施例中,该预设的切词规则可根据实际需要进行灵活设置,例如,基于字典的切词、基于统计的切词或基于人工智能的切词等,具体内容在此处不作限定。
需要说明的是,对待识别的文本信息进行切词时,若需要保证切得的词语与映射关系中存储的词语一致,此时,可以根据映射关系确定对待识别文本信息的切词规则,该映射关系为词语集与参数集之间的映射关系。例如, 多条训练样本中存在某条训练样本“一一二二三三”每隔两个字的切词规则,得到“一一”、“二二”及“三三”,对于接收到的待识别的文本信息“一一一二二三三”,可以切为“一”、“一一”、“二二”及“三三”,这样就可以保证得到的“一一”、“二二”及“三三”与映射关系中存储的一致。
对文本信息进行切词处理后,可以生成至少一个词语,如图3所示,可以是只生成词语1,也可以是生成词语1至词语n等,n为整数,且n>1。该词语可以是由一个中文字组成,也可以是由多个字及其他符号组成,还可以是由英文组成。在一个实施例中,在实际应用中,该词语可以包括变种的词语,具体内容在此处不作限定。变种的词语是指采用有异于规范词语表达的词语,例如,规范词语为“美女”,对应变种的词语为“渼汝”等。
需要说明的是,第一切词单元302可以是实时或每隔预设时间对接收单元301接收到的文本信息进行切词处理,或者是抽样对接收单元301接收到预设数量的文本信息进行切词处理。
参数获取单元303,用于获取至少一个词语对应的参数,每个参数标识一个词语。
在第一切词单元302对文本信息进行切词处理,生成一个或多个词语后,参数获取单元303可以获取一个词语对应的参数,或分别获取多个词语对应的参数,图3中,每个词语对应一个参数。每个参数标识一个词语,该参数可以是一个数字,也可以是唯一标识词语的字符等。例如,“我们”对应的参数为0.1,“我”对应的参数为0.5。
在某些实施方式中,计算机设备预先存储有训练模型,该训练模型包括词语与参数之间的映射关系,参数获取单元303具体用于,根据训练模型中的映射关系获取至少一个词语对应的参数。
在某些实施方式中,参数获取单元303通过计算获取词语对应的参数:首先,获取词语在待识别的文本信息中存在的目标频率,该目标频率即为该词语在待识别的文本信息中存在的频率,例如,对于在某条待识别的文本信息Q中的词语q,词语q在该条待识别的文本信息Q中存在的目标频率的计算公式为:Y=M/X,Y表示词语q在待识别的文本信息Q中的目标频率,M表示词语q在待识别的文本信息Q中出现的次数,X表示在待识别的文本信息Q中所有词语出现的次数之和。
以及,获取在预设时间段内接收到的多条文本信息中,包含该词语的文本信息在该多条文本信息的目标逆向文本频率,该目标逆向文本频率为该词语的文本信息在该多条文本信息的逆向文本频率,其计算公式为:S=log(R/T),S表示目标逆向文本频率,R表示多条文本信息的总数目,T表示包含词语a的目标文本信息的数目,log为对数函数。然后,根据目标频率及目标逆向文本频率生成该词语对应的参数,其计算公式为:H=Y×S。
需要说明的是,参数获取单元303也可以优先根据映射关系获取至少一个词语对应的参数,当该映射关系中不存在至少一个词语对应的参数时,再根据目标频率及目标逆向文本频率计算词语对应的参数。
确定单元304,用于根据参数获取单元303获取到的参数及预置的训练模型确定文本信息的特征信息,训练模型由至少一个类型的模板化文本训练而成。
计算机设备预先设置有训练模型,该训练模型由至少一个类型的模板化文本训练而成。例如,该训练模型由色情信息、卖药信息、投资信息、传销信息等类型中的至少一个类型的模板化文本训练而成。
模板化文本可以为包括变量及模板部分等的文本信息。例如,“看渼汝,你好=丫丫丫丫D有福利”,“看小姐,你好=丫丫丫丫V有福利”,“看小姐,你好=丫丫丫丫E有福利”,这三条文本信息中,可以是由“看[渼汝|小姐],你好=丫丫丫丫[D|V|E]有福利”组成的模板化文本,变量为“渼汝”或“小姐”,以及变量为“D”或“V”或“E”,模板部分为“看,你好=丫丫丫丫有福利”。
在某些实施方式中,如图9所示,计算机设备还包括:
样本获取单元306,用于获取模板化文本对应的多条训练样本;
第二切词单元307,用于按照切词规则将样本获取单元306获取到的每条训练样本分别进行切词处理,生成包含多个词语的词语集;
处理单元308,用于对第二切词单元307生成的词语集进行预处理,生成参数集,参数集中的每个参数用于标识词语集中的每个词语;
聚类单元309,用于根据处理单元308生成的参数集对多条训练样本进行聚类处理,生成文本聚类列表;
生成单元310,用于根据聚类单元309生成的文本聚类列表生成训练模型。
为了有针对性地进行训练,样本获取单元306获取模板化文本对应的多条 训练样本的方式,可以从接收到的历史文本信息中,随机获取模板化文本对应的多条训练样本,也可以是从色情信息、卖药信息、传销信息等不同场景的历史文本信息中抽取多条训练样本,还可以是根据不同场景制造出模板化文本对应的多条训练样本。在一个实施例中,训练样本的条数及获取方式可以根据实际需要进行灵活设置,具体内容在此处不作限定。
在样本获取单元306获取到多条训练样本后,第二切词单元307按照预设的切词规则将每条训练样本分别进行切词处理,该预设的切词规则可以使用任何切词算法,为了提高对文本信息进行处理的可靠性,该预设的切词规则与前述提到的对文本信息进行切词处理的切词规则是一致的,此处不赘述。
第二切词单元307对多条训练样本进行切词处理后,可以生成包含多个词语的词语集,如图4所示。还可以是每条训练样本对应词语集1至词语集n(n>1),组成多条训练样本对应的词语集,词语集1至词语集n中包含的词语可以是一个或多个,n为整数,且n>1。
例如,当100条训练样本中,若每条训练样本均提取出一个词语,则可以生成包含100个词语的词语集;若每条训练样本均切为6个词语,则可以生成包含600个词语的词语集。
然后,处理单元308对得到的词语集进行预处理,生成参数集,如图4所示,参数集中的每个参数用于标识词语集中的每个词语。还可以是每条训练样本对应词语集1至词语集n,分别对应的参数集1至参数集n,组成多条训练样本对应的参数集,n为整数,且n>1。
在一个实施例中,处理单元308具体用于,获取词语集中每个词语在每条训练样本中存在的频率,以及包含词语的目标训练样本在多条训练样本中的逆向文本频率;根据频率及逆向文本频率生成每个词语对应的目标参数;根据每个词语对应的目标参数生成参数集。
处理单元308对词语集进行预处理包括对词语集进行加权算法(term frequency–inverse document frequency,tf-idf)转换,该tf-idf是一种用于信息检索与文本挖掘的加权技术,可以用来评估一个词语对于一条文本信息,或对于多条训练样本中的其中一条训练样本的重要程度。词语的重要性随着它在文本信息中出现的次数成正比增加,随着它在多条训练样本中出现的频率成反比下降。
tf-idf中的tf表示词频,在一份给定的文件里,词频(term frequency,tf)指的是某一个给定的词语在该文件中出现的频率,即本实施例中一个词语在一条训练样本中存在的频率。tf-idf中的idf表示逆向文本频率,是对词语的数量(即出现次数)进行归一化,由于同一个词语在较长的文件里可能会比较短的文件里有更高的词数,而不管该词语重要与否,因此,逆向文本频率以防止词数偏向较长的文件。
逆向文本频率(inverse document frequency,idf)是一个词语普遍重要性的度量。对于在某条训练样本dj中的词语ti,其在该条训练样本dj中存在的频率(即词频)的计算公式为:
Figure PCTCN2018114188-appb-000004
以上式子中,tf i,j表示词语ti在训练样本dj中的词频,n i,j表示词语ti在训练样本dj中出现的次数,∑ kn k,j表示在训练样本dj中所有词语出现的次数之和。例如,当将训练样本dj切为3个词语时,k=3,∑ kn k,j表示在训练样本dj中这3个词语出现的次数之和。
对于词语的ti,包含词语ti的目标训练样本在多条训练样本中的逆向文本频率,可以由多条训练样本的总数目,除以包含该词语ti的目标训练样本的数目,再将得到的商取对数得到,其计算公式如下:
Figure PCTCN2018114188-appb-000005
idf i表示逆向文本频率,|D|表示多条训练样本的总数目,|{j:t i∈d j}|表示包含词语ti的目标训练样本的数目(即n i,j!=0的训练样本数目)。
由于如果该词语ti不在多条训练样本中,就会导致分母为零,因此,可以使用以下计算公式:
Figure PCTCN2018114188-appb-000006
在得到词语ti在某条训练样本dj中存在的频率tf i,j,以及逆向文本频率idf i后,处理单元308可以根据该频率tf i,j及逆向文本频率idf i计算该词语对应的目标参数a,其计算公式为:a=tf i,j×idf i
按照上述方法计算词语集中每个词语在每条训练样本中存在的频率,以 及包含词语的目标训练样本在多条训练样本中的逆向文本频率后,可以根据频率及逆向文本频率生成每个词语对应的目标参数,然后根据每个词语对应的目标参数生成参数集。
词语集中的每个词语与参数集中的每个参数之间可以形成一一对应的映射关系。该映射关系可以理解为字典,在对待识别的文本信息进行切词处理得到至少一个词语后,可以在该字典中查找该至少一个词语对应的参数,而不需要重新计算。或者是,当该字典中不存在某个词语对应的参数时,需要根据前述的tf-idf转换公式计算这个词语对应的参数。
在得到参数集后,聚类单元309可以根据参数集对多条训练样本进行聚类处理,该聚类处理可以包括K-means聚类算法或或层次聚类算法(Balanced Iterative Reducing and Clustering using Hierarchies,BIRCH)等,具体内容在此处不作限定。
聚类单元309根据参数集对多条训练样本进行聚类处理后,可以生成文本聚类列表,图4中,该文本聚类列表中可以包括一种类型的聚类文本形成的一个列表,或者是包括多种类型的聚类文本形成对应的多个列表,每个列表包含一种类型的聚类文本。最后,生成单元310可以根据文本聚类列表生成训练模型,如图4所示。
在一个实施例中,如图10所示,计算机设备还包括:
变换单元311,用于对词语集与参数集之间的映射关系进行变换处理,生成映射关系在预设空间上的投影关系;
在一个实施例中,变换单元311具体用于,根据映射关系生成样本矩阵,其中样本矩阵的每行向量为每条训练样本切词处理后得到的词语对应的参数;
获取样本矩阵的协方差矩阵,以及获取样本矩阵的特征值,根据特征值生成对角矩阵;
根据协方差矩阵及对角矩阵生成转换矩阵,将转换矩阵设定为投影关系。
首先,变换单元311将词语集与参数集之间的映射关系转变为n*p维的样本矩阵dataMat,样本矩阵的行数n表示训练样本的条数,样本矩阵的列数p表示每条训练样本进行切词处理后生成词语的个数。
需要说明的是,为了能够使得映射关系以矩阵的形式呈现,生成矩阵的 每行向量长度需要一致。由于每条训练样本进行切词处理后生成词语的个数可以是一样的,也可以是不一样的,因此对于个数不一样的,为了保证生成矩阵每行的向量长度一致,可以用0将向量长度较短的某行向量补齐,从而可以使得每行的向量长度一致,样本矩阵的每行向量对应为每条训练样本切词处理后得到的词语所对应的参数。
然后,计算样本矩阵dataMat的协方差矩阵X,以及计算样本矩阵dataMat的特征值,并根据特征值生成对角矩阵D,对角矩阵D是一个(p,p)维的对角矩阵,包含了特征值λ 1,λ 2,......λ p
此时,协方差矩阵X可以通过奇异值分解(Singular value decomposition,SVD)计算转换矩阵P,其计算公式如下:
X=PDP T
P是一个(p,p)维的正交矩阵,该正交矩阵即为转换矩阵P,转换矩阵P的每一列都是协方差矩阵X的特征向量。通过SVD可求解出转换矩阵P,将转换矩阵P设定为样本矩阵dataMat(即映射关系)在预设空间上的投影关系。该预设空间可以是主成分空间,该主成分空间为对训练样本的词语所对应的参数。转换矩阵P在主成分空间的投影可以表示为:Y=dataMat×P,Y表示投影关系。
需要说明的是,投影关系也可以是只在样本矩阵dataMat的部分维度上进行的投影,若只使用部分维度top-j主成分,则投影之后的投影关系为:Y j=dataMat×P j,Y j表示部分投影关系,P j表示转换矩阵P的部分维度组成的矩阵。例如,P j可以是转换矩阵P的前j列,也就是说P j是一个(p,j)维的矩阵,Y j是一个(n,j)维的矩阵。
在某些实施方式中,可以根据转换矩阵及投影关系,通过拉回映射从主成分空间映射到原始空间,生成逆映射关系,即可根据逆映射关系确定参数对应的词语。通过拉回映射重构之后得到的逆映射关系是:R j=Y j×(P j) T,R j是使用部分维度top-j的主成分,进行重构之后形成的逆映射关系是一个(n,p)维的矩阵。
在一个实施例中,生成单元310具体用于,根据映射关系、投影关系及文本聚类列表生成训练模型。即将词语集与参数集之间的映射关系(可以是样本矩阵)、映射关系在预设空间上的投影关系(可以是转换矩阵)及文本聚类 列表生成的训练模型进行存储。
在确定词语对应的参数后,确定单元304可以根据参数及训练模型确定文本信息的特征信息,该特征信息可以包括文本信息在文本聚类列表中所属的类别、类别对应的文本数量、以及文本信息与文本聚类列表中训练样本之间相似度等,该特征信息还可以根据实际需要进行灵活设置,具体内容在此处不作限定。
在一个实施例中,如图11所示,确定单元304包括:确定子单元3041,用于根据参数、训练模型中的投影关系及训练模型中的文本聚类列表确定文本信息的特征信息。
在某些实施方式中,确定子单元3041具体用于,根据投影关系将参数在预设空间上进行投影处理,生成投影参数;
获取投影参数与文本聚类列表所在聚类区域的质心之间的最短距离;
根据最短距离确定文本信息在文本聚类列表中所属的类别、类别对应的文本数量、以及文本信息与文本聚类列表中训练样本之间相似度。
首先确定子单元3041将词语对应的参数按照确定的投影关系,在预设空间(例如,主成分空间)上进行投影,生成投影参数。以及,获取文本聚类列表在聚类区域内进行投影生成的质心,该质心可以是一个或者是多个。
然后,确定子单元3041计算投影参数与该质心之间距离,该距离可以是欧式距离、切比雪夫距离或汉明距离等,具体内容在此处不作限定。再确定投影参数与质心之间的最短距离,例如,当只存在一个质心时,该质心与投影参数之间的距离即为最短距离;当存在多个质心时,从多个质心与投影参数之间的距离中取最短距离。
某个质心与投影参数之间的距离越短,说明该某个质心对应的文本聚类列表中的训练样本,与待识别的文本信息之间的相似度越高。在确定最短距离后,可以根据最短距离确定文本信息在文本聚类列表中所属的类别、类别对应的文本数量、以及文本信息与文本聚类列表中训练样本之间相似度等。
在一个实施例中,为了降低计算的复杂度,可以将多条训练样本分配至多个文本库中,然后,分别对每个文本库中的每条训练样本进行切词、聚类等处理,生成每个文本库对应的训练模型,后续再根据每个文本库中的训练模型对文本信息进行识别。
识别单元305,用于根据确定单元304得到的特征信息识别文本信息所属的模板化文本的类型。
在确定文本信息的特征信息后,识别单元305可以根据特征信息得到对文本信息的识别结果,如图3所示,即识别单元305识别出文本信息所属的模板化文本的类型,可以根据文本信息所属的模板化文本的类型确定是否将该文本信息拦截。例如,模板化文本可以包括多种类型,当文本信息属于其中的任意一种类型时,可以将该文本信息进行拦截;当文本信息不属于其中的任意一种类型时,可以将该文本信息进行转发至对应的终端。
需要说明的是,模板化文本可以包括第一种类型和第二种类型,第一种类型为不良信息的模板化文本,第二种类型为正常的模板化文本。当文本信息属于第一种类型时,可以将该文本信息进行拦截;当文本信息属于第二种类型时,可以将该文本信息进行转发至对应的终端。
由于文本信息是社交平台的主要信息载体,同时也是黑色产业传播不良信息的主要渠道,黑色产业主要使用自动机生成模版化文本自动发送,因此,为了拦截黑色产业发送的推销产品的信息、色情信息等不良信息,可以使用计算机设备根据训练模型对接收到的文本信息进行识别,以便拦截掉不良信息。
由上述可知,本申请实施例提供的计算机设备,第一切词单元302通过预设的切词规则对接收单元301接收到的文本信息进行切词处理,生成至少一个词语,并由参数获取单元303获取至少一个词语对应的参数,每个参数标识一个词语;然后,确定单元304根据得到的参数及预置的训练模型确定文本信息的特征信息,训练模型由至少一个类型的模板化文本训练而成,再由识别单元305根据特征信息识别文本信息所属的模板化文本的类型。由于该方案在整个过程中不需要进行词性分析,因此,可以使得识别结果不会受到词语变种、标点符号、和/或其他字符等干扰信息的干扰,从而提高了对文本信息进行识别的准确性。
本申请实施例还提供一种服务器,其可以集成本申请实施例的计算机设备,如图12所示,其示出了本申请实施例所涉及的服务器的结构示意图,具体来讲:
该服务器可以包括一个或者一个以上处理核心的处理器401、一个或一个 以上计算机可读存储介质的存储器402、电源403和输入单元404等部件。本领域技术人员可以理解,图12中示出的服务器结构并不构成对服务器的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。其中:
处理器401是该服务器的控制中心,利用各种接口和线路连接整个服务器的各个部分,通过运行或执行存储在存储器402内的软件程序和/或模块,以及调用存储在存储器402内的数据,执行服务器的各种功能和处理数据,从而对服务器进行整体监控。在一个实施例中,处理器401可包括一个或多个处理核心;优选的,处理器401可集成应用处理器和调制解调处理器,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理无线通信。在一个实施例中,上述调制解调处理器也可以不集成到处理器401中。
存储器402可用于存储软件程序以及模块,处理器401通过运行存储在存储器402的软件程序以及模块,从而执行各种功能应用以及数据处理。存储器402可主要包括存储程序区和存储数据区,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据服务器的使用所创建的数据等。此外,存储器402可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。相应地,存储器402还可以包括存储器控制器,以提供处理器401对存储器402的访问。
服务器还包括给各个部件供电的电源403,优选的,电源403可以通过电源管理系统与处理器401逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。电源403还可以包括一个或一个以上的直流或交流电源、再充电系统、电源故障检测电路、电源转换器或者逆变器、电源状态指示器等任意组件。
该服务器还可包括输入单元404,该输入单元404可用于接收输入的数字或字符信息,以及产生与用户设置以及功能控制有关的键盘、鼠标、操作杆、光学或者轨迹球信号输入。
尽管未示出,服务器还可以包括显示单元等,在此不再赘述。具体在本实施例中,服务器中的处理器401会按照如下的指令,将一个或一个以上的应 用程序的进程对应的可执行文件加载到存储器402中,并由处理器401来运行存储在存储器402中的应用程序,从而实现各种功能,如下:
接收待识别的文本信息;按照预设的切词规则对文本信息进行切词处理,生成至少一个词语;获取至少一个词语对应的参数,每个参数标识一个词语;根据参数及预置的训练模型确定文本信息的特征信息,训练模型由至少一个类型的模板化文本训练而成;根据特征信息识别文本信息所属的模板化文本的类型。
在一个实施例中,该处理器401还可以用于,获取模板化文本对应的多条训练样本;按照切词规则将每条训练样本分别进行切词处理,生成包含多个词语的词语集;对词语集进行预处理,生成参数集,参数集中的每个参数用于标识词语集中的每个词语;根据参数集对多条训练样本进行聚类处理,生成文本聚类列表;根据文本聚类列表生成训练模型。
在一个实施例中,该处理器401还可以用于,获取词语集中每个词语在每条训练样本中存在的频率,以及包含词语的目标训练样本在多条训练样本中的逆向文本频率;根据频率及逆向文本频率生成每个词语对应的目标参数;根据每个词语对应的目标参数生成参数集。
在一个实施例中,该处理器401还可以用于,对词语集与参数集之间的映射关系进行变换处理,生成映射关系在预设空间上的投影关系;根据文本聚类列表生成训练模型的步骤包括:根据映射关系、投影关系及文本聚类列表生成训练模型。
在一个实施例中,该处理器401还可以用于,根据参数、训练模型中的投影关系及训练模型中的文本聚类列表确定文本信息的特征信息。
在一个实施例中,该处理器401还可以用于,根据投影关系将参数在预设空间上进行投影处理,生成投影参数;获取投影参数与文本聚类列表所在聚类区域的质心之间的最短距离;根据最短距离确定文本信息在文本聚类列表中所属的类别、类别对应的文本数量、以及文本信息与文本聚类列表中训练样本之间相似度。
在一个实施例中,该处理器401还可以用于,根据映射关系生成样本矩阵,其中样本矩阵的每行向量为每条训练样本切词处理后得到的词语对应的参数;获取样本矩阵的协方差矩阵,以及获取样本矩阵的特征值,根据特征值 生成对角矩阵;根据协方差矩阵及对角矩阵生成转换矩阵,将转换矩阵设定为投影关系。
在一个实施例中,该处理器401还可以用于,根据训练模型中的映射关系获取至少一个词语对应的参数。
由上述可知,本申请实施例提供的服务器,通过预设的切词规则对接收到的文本信息进行切词处理,生成至少一个词语,并获取至少一个词语对应的参数,每个参数标识一个词语;然后,根据得到的参数及预置的训练模型确定文本信息的特征信息,训练模型由至少一个类型的模板化文本训练而成,再根据特征信息识别文本信息所属的模板化文本的类型。由于该方案在整个过程中不需要进行词性分析,因此,可以使得识别结果不会受到词语变种、标点符号、和/或其他字符等干扰信息的干扰,从而提高了对文本信息进行识别的准确性。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见上文针对文本信息处理方法的详细描述,此处不再赘述。
本领域普通技术人员可以理解,上述实施例的各种方法中的全部或部分步骤可以通过指令来完成,或通过指令控制相关的硬件来完成,该指令可以存储于一计算机可读存储介质中,并由处理器进行加载和执行。
为此,本申请实施例提供一种存储介质,其中存储有多条指令,该指令能够被处理器进行加载,以执行本申请实施例所提供的任一种导航信息处理方法中的步骤。例如,该指令可以执行如下步骤:
接收待识别的文本信息;按照预设的切词规则对文本信息进行切词处理,生成至少一个词语;获取至少一个词语对应的参数,每个参数标识一个词语;根据参数及预置的训练模型确定文本信息的特征信息,训练模型由至少一个类型的模板化文本训练而成;根据特征信息识别文本信息所属的模板化文本的类型。
在一个实施例中,该指令可以执行如下步骤,获取模板化文本对应的多条训练样本;按照切词规则将每条训练样本分别进行切词处理,生成包含多个词语的词语集;对词语集进行预处理,生成参数集,参数集中的每个参数用于标识词语集中的每个词语;根据参数集对多条训练样本进行聚类处理, 生成文本聚类列表;根据文本聚类列表生成训练模型。
以上各个操作的具体实施可参见前面的实施例,在此不再赘述。
该存储介质可以包括:只读存储器(ROM,Read Only Memory)、随机存取记忆体(RAM,Random Access Memory)、磁盘或光盘等。
由于该存储介质中所存储的指令,可以执行本申请实施例所提供的任一种文本信息处理方法中的步骤,因此,可以实现本申请实施例所提供的任一种文本信息处理方法所能实现的有益效果,详见前面的实施例,在此不再赘述。
以上对本申请实施例所提供的一种文本信息处理方法、装置及存储介质进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (20)

  1. 一种文本信息处理方法,该方法由计算机设备实施,所述方法包括:
    接收待识别的文本信息;
    按照预设的切词规则对所述文本信息进行切词处理,生成至少一个词语;
    获取所述至少一个词语对应的参数,每个参数标识一个词语;
    根据所述参数及预置的训练模型确定所述文本信息的特征信息,所述训练模型由至少一个类型的模板化文本训练而成;及
    根据所述特征信息识别所述文本信息所属的模板化文本的类型。
  2. 根据权利要求1所述的文本信息处理方法,其特征在于,所述根据所述参数及预置的训练模型确定所述文本信息的特征信息之前,所述方法还包括:
    获取所述模板化文本对应的多条训练样本;
    按照所述切词规则将每条训练样本分别进行切词处理,生成包含多个词语的词语集;
    对所述词语集进行预处理,生成参数集,所述参数集中的每个参数用于标识所述词语集中的每个词语;
    根据所述参数集对所述多条训练样本进行聚类处理,生成文本聚类列表;及
    根据所述文本聚类列表生成所述训练模型。
  3. 根据权利要求2所述的文本信息处理方法,其特征在于,所述对所述词语集进行预处理,生成参数集包括:
    获取所述词语集中每个词语在所述每条训练样本中存在的频率,以及包含所述词语的目标训练样本在所述多条训练样本中的逆向文本频率;
    根据所述频率及所述逆向文本频率生成所述每个词语对应的目标参数;及
    根据所述每个词语对应的所述目标参数生成所述参数集。
  4. 根据权利要求2所述的文本信息处理方法,其特征在于,所述对所述词语集进行预处理,生成参数集之后,所述方法还包括:
    对所述词语集与所述参数集之间的映射关系进行变换处理,生成所述映射关系在预设空间上的投影关系;
    所述根据所述文本聚类列表生成所述训练模型的步骤包括:及
    根据所述映射关系、所述投影关系及所述文本聚类列表生成所述训练模型。
  5. 根据权利要求4所述的文本信息处理方法,其特征在于,所述根据所述参数及预置的训练模型确定所述文本信息的特征信息包括:
    根据所述参数、所述训练模型中的投影关系及所述训练模型中的文本聚类列表确定所述文本信息的特征信息。
  6. 根据权利要求5所述的文本信息处理方法,其特征在于,所述根据所述参数、所述训练模型中的投影关系及所述训练模型中的文本聚类列表确定所述文本信息的特征信息包括:
    根据所述投影关系将所述参数在所述预设空间上进行投影处理,生成投影参数;
    获取所述投影参数与所述文本聚类列表所在聚类区域的质心之间的最短距离;及
    根据所述最短距离确定所述文本信息在所述文本聚类列表中所属的类别、所述类别对应的文本数量、以及所述文本信息与所述文本聚类列表中训练样本之间相似度。
  7. 根据权利要求4至6中任一项所述的文本信息处理方法,其特征在于,所述对映射关系进行变换处理,生成所述映射关系在预设空间上的投影关系包括:
    根据所述映射关系生成样本矩阵,其中所述样本矩阵的每行向量为每条训练样本切词处理后得到的词语对应的参数;
    获取所述样本矩阵的协方差矩阵,以及获取所述样本矩阵的特征值,根据所述特征值生成对角矩阵;及
    根据所述协方差矩阵及所述对角矩阵生成转换矩阵,将所述转换矩阵设定为所述投影关系。
  8. 根据权利要求4至6中任一项所述的文本信息处理方法,其特征在于,所述获取所述至少一个词语对应的参数包括:
    根据所述训练模型中的所述映射关系获取所述至少一个词语对应的参数。
  9. 一种计算机设备,包括处理器和存储器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行以下步骤:
    接收待识别的文本信息;
    按照预设的切词规则对所述文本信息进行切词处理,生成至少一个词语;
    获取所述至少一个词语对应的参数,每个参数标识一个词语;
    根据所述参数及预置的训练模型确定所述文本信息的特征信息,所述训练模型由至少一个类型的模板化文本训练而成;及
    根据所述特征信息识别所述文本信息所属的模板化文本的类型。
  10. 根据权利要求9所述的计算机设备,其特征在于,所述根据所述参数及预置的训练模型确定所述文本信息的特征信息之前,所述计算机可读指令被所述处理器执行时,使得所述处理器还执行以下步骤:
    获取所述模板化文本对应的多条训练样本;
    按照所述切词规则将每条训练样本分别进行切词处理,生成包含多个词语的词语集;
    对所述词语集进行预处理,生成参数集,所述参数集中的每个参数用于标识所述词语集中的每个词语;
    根据所述参数集对所述多条训练样本进行聚类处理,生成文本聚类列表;及
    根据所述文本聚类列表生成所述训练模型。
  11. 根据权利要求10所述的计算机设备,其特征在于,所述对所述词语集进行预处理,生成参数集包括:
    获取所述词语集中每个词语在所述每条训练样本中存在的频率,以及包含所述词语的目标训练样本在所述多条训练样本中的逆向文本频率;
    根据所述频率及所述逆向文本频率生成所述每个词语对应的目标参数;及
    根据所述每个词语对应的所述目标参数生成所述参数集。
  12. 根据权利要求10所述的计算机设备,其特征在于,所述对所述词语集进行预处理,生成参数集的步骤之后,所述计算机可读指令被所述处理器 执行时,使得所述处理器还执行以下步骤:
    对所述词语集与所述参数集之间的映射关系进行变换处理,生成所述映射关系在预设空间上的投影关系;
    所述根据所述文本聚类列表生成所述训练模型的步骤包括:及
    根据所述映射关系、所述投影关系及所述文本聚类列表生成所述训练模型。
  13. 根据权利要求12所述的计算机设备,其特征在于,所述根据所述参数及预置的训练模型确定所述文本信息的特征信息包括:
    根据所述参数、所述训练模型中的投影关系及所述训练模型中的文本聚类列表确定所述文本信息的特征信息。
  14. 根据权利要求13所述的计算机设备,其特征在于,所述根据所述参数、所述训练模型中的投影关系及所述训练模型中的文本聚类列表确定所述文本信息的特征信息包括:
    根据所述投影关系将所述参数在所述预设空间上进行投影处理,生成投影参数;
    获取所述投影参数与所述文本聚类列表所在聚类区域的质心之间的最短距离;及
    根据所述最短距离确定所述文本信息在所述文本聚类列表中所属的类别、所述类别对应的文本数量、以及所述文本信息与所述文本聚类列表中训练样本之间相似度。
  15. 一种非易失性的计算机可读存储介质,存储有计算机可读指令,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:
    接收待识别的文本信息;
    按照预设的切词规则对所述文本信息进行切词处理,生成至少一个词语;
    获取所述至少一个词语对应的参数,每个参数标识一个词语;
    根据所述参数及预置的训练模型确定所述文本信息的特征信息,所述训练模型由至少一个类型的模板化文本训练而成;及
    根据所述特征信息识别所述文本信息所属的模板化文本的类型。
  16. 根据权利要求15所述的存储介质,其特征在于,所述根据所述参数及预置的训练模型确定所述文本信息的特征信息之前,所述计算机可读指令被处理器执行时,使得所述处理器还执行以下步骤:
    获取所述模板化文本对应的多条训练样本;
    按照所述切词规则将每条训练样本分别进行切词处理,生成包含多个词语的词语集;
    对所述词语集进行预处理,生成参数集,所述参数集中的每个参数用于标识所述词语集中的每个词语;
    根据所述参数集对所述多条训练样本进行聚类处理,生成文本聚类列表;及
    根据所述文本聚类列表生成所述训练模型。
  17. 根据权利要求16所述的存储介质,其特征在于,所述对所述词语集进行预处理,生成参数集包括:
    获取所述词语集中每个词语在所述每条训练样本中存在的频率,以及包含所述词语的目标训练样本在所述多条训练样本中的逆向文本频率;
    根据所述频率及所述逆向文本频率生成所述每个词语对应的目标参数;及
    根据所述每个词语对应的所述目标参数生成所述参数集。
  18. 根据权利要求16所述的存储介质,其特征在于,所述对所述词语集进行预处理,生成参数集的步骤之后,所述计算机可读指令被处理器执行时,使得所述处理器还执行以下步骤:
    对所述词语集与所述参数集之间的映射关系进行变换处理,生成所述映射关系在预设空间上的投影关系;
    所述根据所述文本聚类列表生成所述训练模型的步骤包括:及
    根据所述映射关系、所述投影关系及所述文本聚类列表生成所述训练模型。
  19. 根据权利要求18所述的存储介质,其特征在于,所述根据所述参数及预置的训练模型确定所述文本信息的特征信息包括:
    根据所述参数、所述训练模型中的投影关系及所述训练模型中的文本聚类列表确定所述文本信息的特征信息。
  20. 根据权利要求19所述的存储介质,其特征在于,所述根据所述参数、所述训练模型中的投影关系及所述训练模型中的文本聚类列表确定所述文本信息的特征信息包括:
    根据所述投影关系将所述参数在所述预设空间上进行投影处理,生成投影参数;
    获取所述投影参数与所述文本聚类列表所在聚类区域的质心之间的最短距离;及
    根据所述最短距离确定所述文本信息在所述文本聚类列表中所属的类别、所述类别对应的文本数量、以及所述文本信息与所述文本聚类列表中训练样本之间相似度。
PCT/CN2018/114188 2017-11-20 2018-11-06 文本信息处理方法、计算机设备及计算机可读存储介质 WO2019096032A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201711159103.2A CN108304442B (zh) 2017-11-20 2017-11-20 一种文本信息处理方法、装置及存储介质
CN201711159103.2 2017-11-20

Publications (1)

Publication Number Publication Date
WO2019096032A1 true WO2019096032A1 (zh) 2019-05-23

Family

ID=62869687

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/114188 WO2019096032A1 (zh) 2017-11-20 2018-11-06 文本信息处理方法、计算机设备及计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN108304442B (zh)
WO (1) WO2019096032A1 (zh)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108304442B (zh) * 2017-11-20 2021-08-31 腾讯科技(深圳)有限公司 一种文本信息处理方法、装置及存储介质
CN109389418A (zh) * 2018-08-17 2019-02-26 国家电网有限公司客户服务中心 基于lda模型的供电服务客户诉求识别方法
CN109597888A (zh) * 2018-11-19 2019-04-09 北京百度网讯科技有限公司 建立文本领域识别模型的方法、装置
CN109361962B (zh) * 2018-11-26 2019-08-16 上海竑讯信息科技有限公司 互联网流媒体大数据弹幕信息处理系统及处理方法
CN109815488A (zh) * 2018-12-26 2019-05-28 出门问问信息科技有限公司 自然语言理解训练数据生成方法、装置、设备及存储介质
CN110058858B (zh) * 2019-04-19 2023-05-02 东信和平科技股份有限公司 一种json数据处理方法及装置
CN110110299B (zh) * 2019-04-28 2023-04-07 腾讯科技(上海)有限公司 文本变换方法、装置以及服务器
CN110135413B (zh) * 2019-05-08 2021-08-17 达闼机器人有限公司 一种字符识别图像的生成方法、电子设备和可读存储介质
CN110276081B (zh) * 2019-06-06 2023-04-25 百度在线网络技术(北京)有限公司 文本生成方法、装置及存储介质
CN110995926A (zh) * 2019-11-27 2020-04-10 惠州Tcl移动通信有限公司 一种信息提醒方法、装置、计算机设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101315624A (zh) * 2007-05-29 2008-12-03 阿里巴巴集团控股有限公司 一种文本主题推荐的方法和装置
CN101763431A (zh) * 2010-01-06 2010-06-30 电子科技大学 基于海量网络舆情信息的pl聚类处理方法
CN103336766A (zh) * 2013-07-04 2013-10-02 微梦创科网络科技(中国)有限公司 短文本垃圾识别以及建模方法和装置
CN103441924A (zh) * 2013-09-03 2013-12-11 盈世信息科技(北京)有限公司 一种基于短文本的垃圾邮件过滤方法及装置
CN104112026A (zh) * 2014-08-01 2014-10-22 中国联合网络通信集团有限公司 一种短信文本分类方法及系统
CN108304442A (zh) * 2017-11-20 2018-07-20 腾讯科技(深圳)有限公司 一种文本信息处理方法、装置及存储介质

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6996575B2 (en) * 2002-05-31 2006-02-07 Sas Institute Inc. Computer-implemented system and method for text-based document processing
US8271422B2 (en) * 2008-11-29 2012-09-18 At&T Intellectual Property I, Lp Systems and methods for detecting and coordinating changes in lexical items
CN104217717B (zh) * 2013-05-29 2016-11-23 腾讯科技(深圳)有限公司 构建语言模型的方法及装置
CN105159998A (zh) * 2015-09-08 2015-12-16 海南大学 一种基于文档聚类关键词计算方法
CN105608070B (zh) * 2015-12-21 2019-01-25 中国科学院信息工程研究所 一种面向新闻标题的人物关系抽取方法
CN107229638A (zh) * 2016-03-24 2017-10-03 北京搜狗科技发展有限公司 一种文本信息处理方法及装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101315624A (zh) * 2007-05-29 2008-12-03 阿里巴巴集团控股有限公司 一种文本主题推荐的方法和装置
CN101763431A (zh) * 2010-01-06 2010-06-30 电子科技大学 基于海量网络舆情信息的pl聚类处理方法
CN103336766A (zh) * 2013-07-04 2013-10-02 微梦创科网络科技(中国)有限公司 短文本垃圾识别以及建模方法和装置
CN103441924A (zh) * 2013-09-03 2013-12-11 盈世信息科技(北京)有限公司 一种基于短文本的垃圾邮件过滤方法及装置
CN104112026A (zh) * 2014-08-01 2014-10-22 中国联合网络通信集团有限公司 一种短信文本分类方法及系统
CN108304442A (zh) * 2017-11-20 2018-07-20 腾讯科技(深圳)有限公司 一种文本信息处理方法、装置及存储介质

Also Published As

Publication number Publication date
CN108304442A (zh) 2018-07-20
CN108304442B (zh) 2021-08-31

Similar Documents

Publication Publication Date Title
WO2019096032A1 (zh) 文本信息处理方法、计算机设备及计算机可读存储介质
US20210150142A1 (en) Method and apparatus for determining feature words and server
US10262059B2 (en) Method, apparatus, and storage medium for text information processing
US9858264B2 (en) Converting a text sentence to a series of images
US10445623B2 (en) Label consistency for image analysis
US11856277B2 (en) Method and apparatus for processing video, electronic device, medium and product
JP2022191412A (ja) マルチターゲット画像テキストマッチングモデルのトレーニング方法、画像テキスト検索方法と装置
CN113127605B (zh) 一种目标识别模型的建立方法、系统、电子设备及介质
WO2017101541A1 (zh) 文本聚类方法、装置及计算设备
CN109857957B (zh) 建立标签库的方法、电子设备及计算机存储介质
CN112528022A (zh) 主题类别对应的特征词提取和文本主题类别识别方法
CN109753646B (zh) 一种文章属性识别方法以及电子设备
US20230081015A1 (en) Method and apparatus for acquiring information, electronic device and storage medium
CN109300550B (zh) 医学数据关系挖掘方法及装置
US20230186613A1 (en) Sample Classification Method and Apparatus, Electronic Device and Storage Medium
CN116166814A (zh) 事件检测方法、装置、设备以及存储介质
CN114692778A (zh) 用于智能巡检的多模态样本集生成方法、训练方法及装置
CN113095073A (zh) 语料标签生成方法、装置、计算机设备和存储介质
CN110059180B (zh) 文章作者身份识别及评估模型训练方法、装置及存储介质
CN108009233B (zh) 一种图像还原方法、装置、计算机设备及存储介质
CN111708884A (zh) 文本分类方法、装置及电子设备
CN115905456B (zh) 一种数据识别方法、系统、设备及计算机可读存储介质
CN111708872B (zh) 对话方法、装置及电子设备
CN115378880B (zh) 流量分类方法、装置、计算机设备及存储介质
WO2021056740A1 (zh) 语言模型构建方法、系统、计算机设备及可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18878815

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 18878815

Country of ref document: EP

Kind code of ref document: A1