CN110020420A

CN110020420A - Text handling method, device, computer equipment and storage medium

Info

Publication number: CN110020420A
Application number: CN201810023358.4A
Authority: CN
Inventors: 方小敏
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-01-10
Filing date: 2018-01-10
Publication date: 2019-07-16
Anticipated expiration: 2038-01-10
Also published as: CN110020420B

Abstract

This application involves a kind of text handling method, device, computer equipment and storage mediums, this method comprises: obtaining target text；Word segmentation processing is carried out to the target text, obtains multiple word segments；It from the multiple word segment, takes respectively less than or equal to preset quantity threshold value and the continuous word segment in the target text, constitutes portmanteau word segment；The preset quantity threshold value is the maximum quantity for constituting the word segment of a portmanteau word segment；According to each institute's predicate segment and each portmanteau word segment, the feature vector of target text is generated.The scheme of the application can more accurately represent the semantic information of target text.

Description

Text handling method, device, computer equipment and storage medium

Technical field

The present invention relates to field of computer technology, more particularly to a kind of text handling method, device, computer equipment and Storage medium.

Background technique

With the development of science and technology, the requirement to text intelligent processing is higher and higher.

In conventional method, after carrying out word segmentation processing to text, the semantic of text directly can be obtained according to word segmentation result and believed Breath, text based semantic information perform corresponding processing.However, word segmentation processing often misses out some semantemes of text, In the conventional way in the semantic information of text that is directly obtained according to word segmentation result it is not accurate enough.

Summary of the invention

Based on this, it is necessary to not quasi- enough for the semantic information of the text directly obtained according to word segmentation result in conventional method True problem provides a kind of text handling method, device, computer equipment and storage medium.

A kind of text handling method, which comprises

Obtain target text；

Word segmentation processing is carried out to the target text, obtains multiple word segments；

From the multiple word segment, is taken respectively less than or equal to preset quantity threshold value and connected in the target text Continuous word segment constitutes portmanteau word segment；The preset quantity threshold value is to constitute the maximum of the word segment of a portmanteau word segment Quantity；

According to each institute's predicate segment and each portmanteau word segment, the feature vector of target text is generated.

A kind of text processing apparatus, described device include:

Word segmentation module, for obtaining target text；Word segmentation processing is carried out to the target text, obtains multiple word segments；

Composite module, for being taken respectively less than or equal to preset quantity threshold value and described from the multiple word segment Continuous word segment in target text constitutes portmanteau word segment；The preset quantity threshold value is to constitute a portmanteau word segment The maximum quantity of word segment；

Vector generation module, for generating the spy of target text according to each institute's predicate segment and each portmanteau word segment Levy vector.

A kind of computer equipment, including memory and processor are stored with computer program, the meter in the memory When calculation machine program is executed by processor, so that the processor executes following steps:

Obtain target text；

A kind of storage medium being stored with computer program, when the computer program is executed by processor, so that processing Device executes following steps:

Obtain target text；

Above-mentioned text handling method, device, computer equipment and storage medium carry out word segmentation processing to target text and obtain After word segment, from multiple word segments, take respectively less than or equal to preset quantity threshold value and the continuous word in target text Segment constitutes portmanteau word segment, the ordinal characteristics in target text between each word segment can be embodied in the portmanteau word segment. According to each word segment and each portmanteau word segment, the feature vector of target text is generated, had both included at participle in this feature vector Manage obtained vector characteristics includes the ordinal characteristics between each word segment again, can more accurately represent target text Semantic information.

Detailed description of the invention

Fig. 1 is the flow diagram of text handling method in one embodiment；

Fig. 2 is the schematic illustration that feature vector generates in one embodiment；

Fig. 3 is the schematic illustration of text handling method in one embodiment；

Fig. 4 to Fig. 5 is the effect diagram classified in one embodiment using the application text handling method；

Fig. 6 is the flow diagram of text handling method in another embodiment；

Fig. 7 is the block diagram of text processing apparatus in one embodiment；

Fig. 8 is the block diagram of text processing apparatus in another embodiment；

Fig. 9 is the block diagram of text processing apparatus in another embodiment；

Figure 10 is the schematic diagram of internal structure of computer equipment in one embodiment；

Figure 11 is the schematic diagram of internal structure of computer equipment in one embodiment.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.

Fig. 1 is the flow diagram of text handling method in one embodiment.The present embodiment is mainly with this article present treatment side Method is applied to computer equipment and comes for example, the computer equipment can be terminal or server.Referring to Fig.1, this method has Body includes the following steps:

S102 obtains target text.

Wherein, target text is the text for needing to be indicated with feature vector.In one embodiment, target text can be with It is short text.The text size of short text is shorter, usually text of the length within 100 words.

In one embodiment, target text may include the social group group name of media content name, social network-i i-platform The short texts such as the individualized signature claim, issued in social network-i i-platform or status information.Social network-i i-platform is carried out by network Social platform.Social network-i i-platform may include immediate communication platform (for example, wechat, WeChat, are that Tencent releases One provides the application program of instant messaging service for intelligent terminal).

It is appreciated that target text is also possible to long text, for example, article or media content etc..

S104 carries out word segmentation processing to target text, obtains multiple word segments.

Wherein, word segmentation processing is to divide the content of text of target text, is divided into multiple word segments.

In one embodiment, computer equipment can carry out context semantic analysis to target text, according to semanteme point Analysis is as a result, carry out word segmentation processing for target text.

In one embodiment, computer equipment can be by each word progress in target text and preset dictionary Match, word segmentation processing is carried out to target text according to matching result.

For example, target text is " Zhanjiang people is in Guangzhou ", then word segmentation processing can be carried out to the target text, obtain word piece Section " Zhanjiang ", " people ", " " and " Guangzhou ".

S106 takes less than or equal to preset quantity threshold value and continuous in target text respectively from multiple word segments Word segment constitutes portmanteau word segment.

Wherein, preset quantity threshold value is the maximum quantity for constituting the word segment of a portmanteau word segment.

The continuous word segment in target text, is adjacent word segment in target text.For example, target text is " people from Zhanjiang is in Guangzhou ", then word segment " Zhanjiang " and " people " are the continuous word segment in target text, and " Zhanjiang " and " " is just not belonging in target text continuous word segment, because " Zhanjiang " and " " is spaced in target text " people ".

Specifically, computer equipment can from multiple word segments, take respectively less than or equal to the preset quantity threshold value and Continuous word segment, that is, the word segment taken in target text meet quantity less than or equal to the preset quantity threshold value and in mesh Mark this continuous condition in text.

In one embodiment, computer equipment is carrying out word segmentation processing to target text, after obtaining multiple word segments, meter Calculating machine equipment can also obtain according to part of speech label from word segmentation processing for the corresponding part of speech label of each word fragment label Crucial word segment is chosen in multiple word segments.For example, " ", " " etc. words be auxiliary word, the meaning of target text is not had Preferable expressional function, it is possible to from target text remove " " and " " etc. have auxiliary word label word, remained Remaining crucial word segment.Computer equipment can be taken respectively from the word segment of multiple keys less than or equal to preset quantity Threshold value and the continuous word segment in the target text.

Computer equipment can be spliced the word segment taken by the sequencing in target text, obtain portmanteau word Segment.

For example, the sequence of " Zhanjiang " in target text " Zhanjiang people is in Guangzhou " is wanted in { " Zhanjiang ", " people " } this combination Prior to " people ", so the word segment in the combination is spliced, available { Zhanjiang according to the sequence in target text People } this portmanteau word segment.

S108 generates the feature vector of target text according to each word segment and each portmanteau word segment.

Specifically, the available pre-stored dictionary of computer equipment by each word segment and each portmanteau word segment and is preset Dictionary matched, according to matching result, generate the feature vector of target text.Wherein, summarizing in preset dictionary has not Duplicate word.It is appreciated that the pre-stored dictionary obtained is according to word segment and portmanteau word segment to initial dictionary The dictionary obtained after being updated.

Above-mentioned text handling method, after obtaining word segment to target text progress word segmentation processing, from multiple word segments, It takes respectively less than or equal to preset quantity threshold value and the continuous word segment in target text, constitutes portmanteau word segment, the combination The ordinal characteristics in target text between each word segment can be embodied in word segment.According to each word segment and each portmanteau word piece Section, generates the feature vector of target text, had not only included the vector characteristics that word segmentation processing obtains in this feature vector but also included Ordinal characteristics between each word segment can more accurately represent the semantic information of target text.

In one embodiment, step S106 includes: to obtain preset quantity threshold value；From greater than 1 and less than or equal to described pre- If in the range of amount threshold, successively choosing integer and being used as with reference to selection quantity；From multiple word segments, respectively by each with reference to choosing Access measures the continuous word segment in target text, constitutes portmanteau word segment.

Specifically, computer equipment can successively be selected in the range of being greater than 1 and being less than or equal to the preset quantity threshold value Round numbers is used as with reference to selection quantity.Computer equipment can be taken as each with reference to selection quantity respectively from multiple word segments Continuous word segment in target text constitutes portmanteau word segment.

It is appreciated that taking the continuous word segment in target text with reference to selection quantity by each, refer to the word segment taken Quantity is that quantity is chosen in reference and the word segment taken is continuous in target text.

For example, preset quantity threshold value is 4, in the range of being greater than 1 and being less than or equal to 4, integer is successively chosen as reference Quantity is chosen, then is 2,3 and 4 with reference to quantity is chosen.Computer equipment can take continuous 2 from multiple word segments respectively Word segment, continuous 3 word segments and continuous 4 word segments.

In above-described embodiment, in the range of being greater than 1 and being less than or equal to the preset quantity threshold value, successively chooses integer and make For with reference to selection quantity；From multiple word segments, the continuous word segment in target text is taken with reference to selection quantity by each respectively, Multiple portmanteau word segments with ordinal characteristics can be more fully obtained, so that the feature vector of the target text ultimately generated In, not only included the vector characteristics that word segmentation processing obtains but also had included the ordinal characteristics between each word segment, it can be more accurate Ground represents the semantic information of target text.

In one embodiment, it from multiple word segments, is taken respectively by each with reference to selection quantity continuous in target text Word segment, constitute portmanteau word segment include: from the starting word segment in multiple word segments, one by one choose current word segment； Continuous word segment is chosen in target text from current word segment with reference to quantity is chosen according to each respectively；By current word piece Section and the word segment accordingly chosen constitute portmanteau word segment.

Wherein, word segment is originated, is to be located at the first word segment in target text.It should be noted that choosing one by one Current word segment is the ascending order of the sequence according to multiple word segments in target text, chooses current word segment one by one.

Specifically, since computer equipment can choose current word piece from multiple word segments originating word segment one by one Section is chosen in target text continuous word segment with reference to quantity is chosen according to each respectively from current word segment；By current word Segment and the word segment accordingly chosen constitute portmanteau word segment.It is appreciated that current word segment and the word segment accordingly chosen The sum of quantity is equal to reference to selection quantity.

It is appreciated that computer equipment is being chosen in target text continuous word segment according to a current word segment Afterwards, next word segment can be chosen from multiple word segments as current word according to the ascending order of the sequence in target text Segment, and the step of repeating foregoing description continues to execute, such iteration, chooses stop condition until meeting.

Wherein, stop condition is chosen, the condition for choosing word segment is off.Choose stop condition, the group that can be The quantity for closing word segment reaches preset threshold, alternatively, the word segment for having chosen preset quantity has been carried out as current word segment Processing is stated, or, above-mentioned iterative cycles terminate.

It should be noted that when subsequent word segment is as current word segment, it is understood that there may be current word segment and back Word segment the sum of quantity be unable to satisfy part with reference to choose quantity the case where, in this case, then can ignore the reference Quantity is chosen, is not handled accordingly.

For example, target text be " people from Zhanjiang is in Guangzhou ", multiple word segments be respectively " Zhanjiang ", " people ", " " and " extensively State ", preset quantity threshold value are 4, then are respectively 2,3 and 4 with reference to quantity is chosen.It is appreciated that because current word segment and selection The sum of the adjacent quantity of word segment of timing be equal to reference to quantity is chosen, so choosing quantity in reference is respectively 2,3 and 4 When, 1,2 and 3 word segment for needing since " Zhanjiang " the sequence of selection respectively adjacent, then the word piece of available selection Section is respectively { " Zhanjiang ", " people " }, { " Zhanjiang ", " people ", " " } and { " Zhanjiang ", " people ", " ", " Guangzhou " }.Then, it calculates Machine equipment can regard next word segment " people " as current word segment, it will be understood that 2 word pieces are only remained after word segment " people " Section can then ignore the reference so the sum of the quantity of " people " and remaining 2 word segments is unable to satisfy with reference to quantity 4 is chosen Quantity is chosen not deal with.Computer equipment can since " people " adjacent 1 and the 2 word segments of selection sequence, then can be with The word segment chosen is respectively { " people ", " " } and { " people ", " ", " Guangzhou " }.Followed by computer equipment can incite somebody to action Next word segment " " is used as current word segment, and 1 adjacent word segment " Guangzhou " of selection sequence, is chosen since " " Word segment be respectively { " ", " Guangzhou " }.Since " Guangzhou " is the last one word segment, when as current word segment, behind There is no word segment, does not then deal with.

In above-described embodiment, by choosing current word segment one by one from the starting word segment in multiple word segments, respectively According to each with reference to quantity is chosen, continuous word segment is chosen in target text from current word segment；By current word segment and The word segment accordingly chosen constitutes portmanteau word segment, so that obtained portmanteau word segment had not only had ordinal characteristics but also the company of maintaining Continuous property enables portmanteau word segment more accurately to indicate file destination so that semanteme caused by being optionally combined be avoided to destroy, And then it is more accurate based on the feature vector that the portmanteau word segment and word segment obtain.

In one embodiment, step S108 includes: to obtain each word segment respectively and each portmanteau word segment counts accordingly Characteristic value；Each word segment and each portmanteau word segment matched word in dictionary are determined respectively；Using each statistical characteristics as to Secondary element is added in vector template at position corresponding with matched word, is not corresponded in dictionary not in juxtaposition vector template The vector element being fitted at the position of the word of any word segment and portmanteau word segment is default value, obtains the feature of target text Vector；The word in each position and dictionary in vector template corresponds.

Wherein, statistical characteristics is feature determined by statistical and characterization word segment or portmanteau word segment Value.Statistical characteristics can be word frequency (TF, termfrequency), inverse file frequency (IDF, inversedocument ) or word frequency-inverse file frequency (TF-IDF) etc. frequency.Word frequency is the frequency that some word occurs in affiliated text Rate.Inverse file frequency, is the measurement of a word general importance, and the inverse file frequency of a certain word can be removed by general act number With the number of the file comprising the word, then logarithm is taken to obtain the obtained quotient.Word frequency-inverse file frequency is word frequency and inverse text The product of part frequency.

It is appreciated that summarizing in dictionary has unduplicated word.It is appreciated that dictionary is according to word segment and portmanteau word The dictionary that segment obtains after being updated to initial dictionary.

It should be noted that computer equipment can carry out signature analysis to each word segment and each portmanteau word segment respectively, Corresponding statistical characteristics is obtained, existing each word segment and each corresponding statistical nature of portmanteau word segment can also be directly acquired Value.

Specifically, computer equipment can respectively by each word segment and each portmanteau word segment with it is each in pre-stored dictionary Word is matched, and each word segment and each portmanteau word segment matched word in dictionary are obtained.Computer equipment can determine In dictionary matched word position corresponding in vector template.Computer equipment can using each statistical characteristics as to Secondary element is added at corresponding matched word position corresponding in vector template, and is determined not match in dictionary and be taken office The word of what word segment and portmanteau word segment, by the element vector at the word not being matched to position corresponding in vector template Element is set to default value, to obtain the feature vector of target text.Wherein, each position in vector template and the word in dictionary It corresponds.In one embodiment, default value can be 0.

In one embodiment, obtaining each word segment and each corresponding statistical characteristics of portmanteau word segment respectively includes: point The word frequency of each word segment and each portmanteau word segment in target text is not calculated.It is added each statistical characteristics as vector element Into vector template at position corresponding with matched word, corresponds in dictionary in juxtaposition vector template and be not matched to any word Vector element at the position of the word of segment and portmanteau word segment is default value, and the feature vector for obtaining target text includes: It is added to each word frequency as vector element in vector template at position corresponding with matched word, and will be right in vector template It should not be matched to the vector element at the position of the word of any word segment and portmanteau word segment in dictionary and be set to default value, obtain To the feature vector of target text.

Fig. 2 is the schematic illustration that feature vector generates in one embodiment.Referring to Fig. 2, the word in dictionary is { profound River, people, Guangzhou, the U.S., Shanghai, people from Zhanjiang, people from Zhanjiang, people from Zhanjiang Guangzhou, people, people in Guangzhou, in Guangzhou, it is false Corresponding words segment and portmanteau word segment as obtained target text in the manner described above are respectively " Zhanjiang ", " people ", " ", " wide State ", " people from Zhanjiang ", " people from Zhanjiang exists ", " people from Zhanjiang is in Guangzhou ", " people exists ", " people is in Guangzhou " and " in Guangzhou ", by each word segment Word frequency with portmanteau word segment is as statistical characteristics, by the word not being matched in dictionary " U.S. " and " Shanghai " in vector mould The vector element of the corresponding position of plate is set to 0, the feature vector of available target text be (1,1,1,1,0,0,1,1,1, 1,1,1)。

In above-described embodiment, by the way that each word segment and each portmanteau word segment are matched with pre-stored dictionary；It will be each Statistical characteristics is added in vector template at position corresponding with the matched word of institute as vector element, by dictionary not The word matched vector element of corresponding position in vector template is set to default value, obtains the feature vector of target text.It should Not only included the vector characteristics that word segmentation processing obtains in obtained feature vector but also included the ordinal characteristics between each word segment, The semantic information of target text can more accurately be represented.

In one embodiment, this method further include: the feature vector of target text is input in disaggregated model, is exported Corresponding tag along sort；For target text labeled bracketing label.

Wherein, disaggregated model is to carry out machine learning previously according to the feature vector of sample text and corresponding tag along sort The model that training obtains.Disaggregated model, the feature vector for the text according to input exports corresponding tag along sort, with determination Classification belonging to the object that the text is identified.For example, exporting corresponding contingency table for the social group name according to input It signs, classification belonging to the social group identified with the determining social activity group name.

Specifically, the feature vector of target text can be input in disaggregated model by computer equipment, and output is corresponding Tag along sort.Computer equipment can mark corresponding tag along sort for target text.

In above-described embodiment, based on the feature vector that can more accurately express target text, point of target text is obtained Class label improves the accuracy of the classification to target text.

In one embodiment, this method include thes steps that disaggregated model training, specifically includes the following steps: obtaining sample This text and corresponding tag along sort；Generate feature vector corresponding with sample text；The feature according to corresponding to sample text The corresponding tag along sort of vector sum carries out machine learning training, obtains disaggregated model.

Specifically, the available sample text of computer equipment and corresponding tag along sort.Computer equipment can be to sample This text carries out signature analysis, generates corresponding feature vector, computer equipment can be according to the feature corresponding to sample text The corresponding tag along sort of vector sum carries out machine learning training, obtains disaggregated model.The disaggregated model, for the text according to input This feature vector exports corresponding tag along sort, to determine classification belonging to object that the text is identified.

In one embodiment, computer equipment can be based on multinomial Bayesian Classification Arithmetic (multinomial model (multinomial bayesian classifier) or neural network algorithm (Neural Networks), in conjunction with sample text Feature vector and corresponding tag along sort corresponding to this carry out machine learning training, obtain disaggregated model.

In one embodiment, target text is social group name, and corresponding tag along sort is group's purposes label.The party Method further include: for statistical analysis to group's purposes label in social network-i i-platform；Group is screened according to the result of statistical analysis to use Way label；In social group corresponding with the group's purposes label filtered out, recommend corresponding with the group's purposes label filtered out Information.

Wherein, group's purposes label is the label for characterizing social group purposes.Group purposes label may include work, family, The labels such as marketing, classmate, education, reading or shopping.Social network-i i-platform is that social platform is carried out by network.Social network Network platform may include immediate communication platform (for example, wechat, WeChat, are that one of Tencent's release mentions for intelligent terminal For the application program of instant messaging service) and social information sharing platform.Social information sharing platform, be by sharing information with Social platform is realized, for example, microblogging, blog, forum or discussion bar etc..

Specifically, computer equipment can be for statistical analysis to group's purposes label in social network-i i-platform, according to system The result of meter analysis screens group's purposes label.In one embodiment, it is right can to count each group's purposes label institute for computer equipment The social group's quantity answered screens group's purposes label according to social group's quantity.Computer equipment can filter out corresponding social activity Group purposes label of group's quantity in preceding presetting digit capacity.It is appreciated that social group's quantity corresponding to group's purposes label, is group The quantity for the social group that purposes label is identified.If there is corresponding group's purposes label in a social group, illustrate the social activity Social group's purposes that the purposes of group is characterized for this group of purposes labels.

Computer equipment can obtain corresponding information for the group's purposes label filtered out.Wherein, the information of acquisition, The group's purposes characterized with group's purposes label matches, for realizing this group of purposes.Computer equipment can with filter out Group's purposes label in social group, recommends acquired information corresponding with group's purposes label that is filtering out accordingly.

For example, the group's purposes label filtered out includes label reading, then available article corresponding with the label reading, So it can recommend the article obtained in social group corresponding with group's label is read.For another example, the group's purposes mark filtered out Label include shopping label, then available resource promotion message (for example, advertising information) corresponding with the shopping label, with read It reads to recommend the resource promotion message obtained in the corresponding social group of group's label.

In above-described embodiment, based on the feature vector that can more accurately express social group name, obtains corresponding group and use Way label, this group of purposes labels can more accurately represent the purposes of social group represented by social group name, from And it is more accurate based on group's purposes label group's purposes label for statistical analysis filtered out.And then with group's purposes for filtering out Label in social group, recommends information corresponding with the group's purposes label filtered out, can be improved the standard of information recommendation accordingly True property.

In one embodiment, target text is media content name；Corresponding tag along sort is media content type mark Label.This method further include: obtain the corresponding media content type label of target user's mark；Inquiry and the media content obtained The corresponding media content name of type label；The media content name and corresponding matchmaker that push inquires are identified according to target user Hold in vivo.

Wherein, media content name is the name to media content.Media content is the information that can be used for transmitting, propagate Content.Media content type label is the label for characterizing media content type.Media content type may include sport, amusement, The type of the media contents event such as culture, political and military.Target user's mark, is to push to media content pushed information User identifier.

It is appreciated that media content type label corresponding with target user's mark, identifies institute for characterizing target user Interested media content type.

Specifically, the corresponding relationship between user identifier and media content type label is stored in computer equipment, is counted The available target user's mark of machine equipment is calculated, and according to the corresponding relationship, is obtained in the corresponding media of target user's mark Hold type label.Computer equipment can also be from searching matchmaker corresponding with target user's mark in database or from other equipment Hold type label in vivo.

The corresponding relationship being pre-stored in computer equipment between media content name and media content type label.One In a embodiment, there is media content type label for media content name correspondence markings in computer equipment.Computer equipment Media content name corresponding with the media content type label obtained can be inquired according to the corresponding relationship.

The available media content corresponding with the media content name inquired of computer equipment.It is appreciated that calculating The corresponding relationship of media content name and media content can be stored in advance in machine equipment, according to the corresponding relationship, obtain with The corresponding media content of the media content name inquired.Computer equipment can also be obtained from database or other equipment with The corresponding media content of the media content name inquired.

Computer equipment can be identified according to target user, be pushed in the media content name inquired and corresponding media Hold.In one embodiment, computer equipment can push to media content pushed information corresponding to target user's mark Terminal.

In above-described embodiment, based on the feature vector that can more accurately express media content name, respective media is obtained Content type label, the media content type label can more accurately represent in media represented by media content name The type of appearance.According to the corresponding media content type label of target user's mark, corresponding media content name is obtained, so that The media content name of acquisition is more accurate, more meets the demand of user.In turn, by media content name and corresponding matchmaker Hold in vivo to target user's mark and push, improves the accuracy of the media content pushed to target user.

Fig. 3 is the schematic illustration of text handling method in one embodiment.Referring to Fig. 3, computer equipment is to target text This progress word segmentation processing obtains word segment 1 to 3, and preset quantity threshold value is 3, then can be by quantity 2 and 3 as with reference to selection number Amount.Quantity 2, which is chosen, according to reference from 3 word segments chooses continuous word segment, available word segment 1 and 2 and word piece Section 2 and 3.According to the word segment 1 and 2 of selection, available portmanteau word segment A is available according to the word segment 2 and 3 of selection Portmanteau word segment B.Quantity 3 is chosen according to reference and chooses continuous word segment, and available word segment 1,2 and 3 obtains portmanteau word Segment C.Computer equipment and can be determined by word segment 1 to 3 and portmanteau word segment A to C respectively with word match in dictionary Position of the matched word in vector template in dictionary.The available word segment 1 to 3 of computer equipment and portmanteau word piece The statistical characteristics of section A to C, the statistical characteristics filling word segment 1 to 3 and portmanteau word segment A to C that will acquire are in dictionary Matched word is at the position in vector template.

Fig. 4 to Fig. 5 is the effect diagram classified in one embodiment using the application text handling method.Reference Fig. 4, Fig. 4 are that the text vector generated using traditional segmenting method carries out classification processing, obtain the probability that text belongs to each classification Crosstab.The crosstab is now explained by taking the first row as an example." classification 1 " of the first row indicates classification belonging to text reality, the The probability in percent characterization text of a line carries out classification processing by the feature vector that tradition participle generates, and belonging to for obtaining is each The probability of classification belongs to the probability of " classification 2 " of secondary series for example, the probability for belonging to " classification 1 " of first row is 93.43% It is 0.13% etc..Fig. 5 is that the feature vector obtained using the text handling method in the embodiment of the present application carries out classification processing, Obtain the crosstab that text belongs to the probability of each classification.The crosstab is equally illustrated by taking the first row as an example.The first row " point Class 1 " indicates classification belonging to text reality, and the probability in percent characterization text of the first row passes through the text in the embodiment of the present application The feature vector that treatment method obtains carries out classification processing, and what is obtained belongs to the probability of each classification, for example, belonging to first row The probability of " classification 1 " is 94.05%, and the probability for belonging to " classification 2 " of secondary series is 0.12% etc..It is obvious that being matched in Fig. 5 The probability being higher than mostly in Fig. 4 to the probability correctly classified, for example, text actually belongs to " classification 1 " in Fig. 5, and obtained category In the probability of " classification 1 " be 94.05%, the probability 93.43% being higher than in Fig. 4, be matched in explanatory diagram 5 correctly classify it is general Rate is higher than Fig. 4, i.e., carries out classification processing, phase according to the feature vector that the text handling method provided in the embodiment of the present application obtains Classification processing is carried out compared with the text vector generated according to traditional segmenting method, the accuracy of classification is higher.

As shown in fig. 6, in one embodiment, providing another text handling method, this method specifically includes following Step:

S602 obtains social group name；Word segmentation processing is carried out to social group name, obtains multiple word segments.

S604, obtains preset quantity threshold value, and preset quantity threshold value is to constitute the maximum of the word segment of a portmanteau word segment Quantity.

S606 successively chooses integer and reference is used as to choose number in the range of being greater than 1 and being less than or equal to preset quantity threshold value Amount.

S608 chooses current word segment from the starting word segment in multiple word segments one by one；Respectively according to each with reference to choosing Access amount is chosen in target text continuous word segment from current word segment.

S610, the word segment by current word segment and accordingly chosen constitute portmanteau word segment.

S612 obtains each word segment and each corresponding statistical characteristics of portmanteau word segment respectively；Each word segment is determined respectively With each portmanteau word segment in dictionary matched word.

Each statistical characteristics is added in vector template position corresponding with matched word by S614 Place, correspond in juxtaposition vector template at the position for the word for not being matched to any word segment and portmanteau word segment in dictionary to Secondary element is default value, obtains the feature vector of target text.

Wherein, each position in vector template and the word in dictionary correspond.

The feature vector of social group name is input in disaggregated model by S616, exports corresponding group's purposes label；Needle This group of purposes labels are marked to social group name.

In one embodiment, this method further includes disaggregated model training step, specifically includes the following steps: obtaining sample Text and corresponding tag along sort；Generate feature vector corresponding with sample text；The feature according to corresponding to sample text to Amount and corresponding tag along sort carry out machine learning training, obtain disaggregated model.

S618, it is for statistical analysis to group's purposes label in social network-i i-platform.

S620 screens group's purposes label according to the result of statistical analysis, in society corresponding with the group's purposes label filtered out In the social group for handing over group name to be characterized, recommend information corresponding with the group's purposes label filtered out.

Above-mentioned text handling method, after obtaining word segment to target text progress word segmentation processing, from multiple word segments Different combinations is taken, each combined word segment is merged by the sequence in target text, obtains portmanteau word segment, the combination The ordinal characteristics in target text between each word segment can be embodied in word segment.According to each word segment and each portmanteau word piece Section, generates the feature vector of target text, had not only included the vector characteristics that word segmentation processing obtains in this feature vector but also included Ordinal characteristics between each word segment can more accurately represent the semantic information of target text.

As shown in fig. 7, in one embodiment, providing a kind of text processing apparatus 700, which includes: participle mould Block 702, composite module 704 and vector generation module 706, in which:

Word segmentation module 702, for obtaining target text；Word segmentation processing is carried out to target text, obtains multiple word segments.

Composite module 704, for from the multiple word segment, take respectively less than or equal to preset quantity threshold value and Continuous word segment in the target text constitutes portmanteau word segment；The preset quantity threshold value is to constitute a portmanteau word piece The maximum quantity of the word segment of section.

Vector generation module 706, for according to each word segment and each portmanteau word segment, generate the feature of target text to Amount.

In one embodiment, composite module 704 is also used to obtain preset quantity threshold value；Obtain preset quantity threshold value；From Greater than 1 and it is less than or equal in the range of the preset quantity threshold value, successively chooses integer and be used as with reference to selection quantity；From described more In a word segment, quantity is chosen by each reference respectively and takes the continuous word segment in the target text, constitutes portmanteau word Segment.

In one embodiment, composite module 704 is also used to from the starting word segment in multiple word segments, is chosen one by one Current word segment；Quantity is chosen according to each reference respectively, it is continuous from being chosen at current word segment in the target text Word segment；The word segment by current word segment and accordingly chosen constitutes portmanteau word segment.

In one embodiment, vector generation module 706 is also used to obtain each word segment and each portmanteau word segment phase respectively The statistical characteristics answered；Each word segment and each portmanteau word segment matched word in the dictionary are determined respectively；By each statistics Characteristic value is added in vector template at position corresponding with the matched word as vector element, vector mould described in juxtaposition It is corresponding to the vector element at the position for the word for not being matched to any word segment and portmanteau word segment in the dictionary in plate Default value obtains the feature vector of target text；The word in each position and the dictionary in the vector template is one by one It is corresponding.

As shown in figure 8, in one embodiment, the device 700 further include:

Categorization module 708 exports corresponding contingency table for the feature vector of target text to be input in disaggregated model Label；For target text labeled bracketing label.

In one embodiment, categorization module 708 is also used to obtain sample text and corresponding tag along sort；Generation and sample The corresponding feature vector of this text；The feature vector according to corresponding to sample text and corresponding tag along sort carry out machine learning Training, obtains disaggregated model.

As shown in figure 9, in one embodiment, target text is social group name, corresponding tag along sort is group's use Way label.The device 700 further include:

Information recommendation module 710, for for statistical analysis to group's purposes label in social network-i i-platform；According to statistics The result of analysis screens group's purposes label；In social group corresponding with the group's purposes label filtered out, recommend and the sieve The corresponding information of group's purposes label selected.

In one embodiment, target text is media content name；Corresponding tag along sort is media content type mark Label.Information recommendation module 710 is also used to obtain the corresponding media content type label of target user's mark；Inquiry and acquisition The corresponding media content name of media content type label；It is identified according to the target user in the media that push inquires Hold title and corresponding media content.

Figure 10 is the schematic diagram of internal structure of computer equipment in one embodiment.The computer equipment includes passing through system Processor, memory and the network interface of bus connection.Wherein, memory includes non-volatile memory medium and built-in storage. The non-volatile memory medium of the computer equipment can storage program area and computer program.The computer program is performed When, it may make processor to execute a kind of text handling method.The processor of the computer equipment calculates and controls energy for providing Power supports the operation of entire computer equipment.Computer program can be stored in the built-in storage, the computer program is processed When device executes, processor may make to execute a kind of text handling method.The network interface of computer equipment is logical for carrying out network Letter.

It will be understood by those skilled in the art that structure shown in Figure 10, only part relevant to application scheme The block diagram of structure, does not constitute the restriction for the computer equipment being applied thereon to application scheme, and specific computer is set Standby may include perhaps combining certain components or with different component layouts than more or fewer components as shown in the figure.

Figure 11 is the schematic diagram of internal structure of computer equipment in one embodiment.The computer equipment includes passing through system Processor, memory, network interface, display screen and the input unit of bus connection.Wherein, memory includes non-volatile memories Medium and built-in storage.The non-volatile memory medium of the computer equipment can storage program area and computer program.The meter Calculation machine program is performed, and processor may make to execute a kind of text handling method.The processor of the computer equipment is for mentioning For calculating and control ability, the operation of entire computer equipment is supported.Computer program can be stored in the built-in storage, the meter When calculation machine program is executed by processor, processor may make to execute a kind of text handling method.The network interface of computer equipment For carrying out network communication.The display screen of computer equipment can be liquid crystal display or electric ink display screen etc..It calculates The input unit of machine equipment can be the touch layer covered on display screen, be also possible to the key being arranged in terminal enclosure, track Ball or Trackpad are also possible to external keyboard, Trackpad or mouse etc..The computer equipment can be personal computer, move Dynamic terminal or mobile unit, mobile terminal include in mobile phone, tablet computer, personal digital assistant or wearable device etc. at least It is a kind of.

It will be understood by those skilled in the art that structure shown in Figure 11, only part relevant to application scheme The block diagram of structure, does not constitute the restriction for the computer equipment being applied thereon to application scheme, and specific computer is set Standby may include perhaps combining certain components or with different component layouts than more or fewer components as shown in the figure.

In one embodiment, text processing apparatus provided by the present application can be implemented as a kind of shape of computer program Formula, computer program can be run in the computer equipment as shown in Figure 10 or Figure 11, the non-volatile memories of computer equipment Medium can store each program module of composition text processing unit, for example, word segmentation module shown in Fig. 7 702, composite module 704 and vector generation module 706.Computer program composed by each program module is for making the computer equipment execute sheet Step in the text handling method of each embodiment of the application described in specification, for example, computer equipment can pass through Word segmentation module 702 in text processing apparatus 700 as shown in Figure 7 obtains target text；Word segmentation processing is carried out to target text, Obtain multiple word segments, and through composite module 704 from multiple word segments, take respectively less than or equal to preset quantity threshold value, And the continuous word segment in target text, constitute portmanteau word segment；Preset quantity threshold value is to constitute a portmanteau word segment The maximum quantity of word segment.Computer equipment can by vector generation module 706 according to each word segment and each portmanteau word segment, Generate the feature vector of target text.

In one embodiment, a kind of computer equipment, including memory and processor are provided, is stored in memory Computer program, when computer program is executed by processor, so that processor executes following steps: obtaining target text；To mesh It marks text and carries out word segmentation processing, obtain multiple word segments；From multiple word segments, taken respectively less than or equal to preset quantity threshold Value and the continuous word segment in target text constitute portmanteau word segment；Preset quantity threshold value is to constitute a portmanteau word segment Word segment maximum quantity；According to each word segment and each portmanteau word segment, the feature vector of target text is generated.

In one embodiment, it from multiple word segments, is taken respectively less than or equal to preset quantity threshold value and in target text Continuous word segment in this, constituting portmanteau word segment includes: to obtain preset quantity threshold value；From greater than 1 and less than or equal to present count In the range of measuring threshold value, successively chooses integer and be used as with reference to selection quantity；From multiple word segments, respectively by each with reference to selection number The continuous word segment in target text is measured, portmanteau word segment is constituted.

In one embodiment, according to each word segment and each portmanteau word segment, the feature vector for generating target text includes: Each word segment and each corresponding statistical characteristics of portmanteau word segment are obtained respectively；Each word segment and each portmanteau word segment are determined respectively The matched word in dictionary；It is added in vector template using each statistical characteristics as vector element corresponding with matched word Position at, in juxtaposition vector template correspond to dictionary in be not matched to any word segment and portmanteau word segment word position The vector element at place is default value, obtains the feature vector of target text；The word in each position and dictionary in vector template Language corresponds.

In one embodiment, computer program also make processor execute following steps: by the feature of target text to Amount is input in disaggregated model, exports corresponding tag along sort；For target text labeled bracketing label.

In one embodiment, computer program also makes processor execute following steps: obtaining sample text and corresponding Tag along sort；Generate feature vector corresponding with sample text；The feature vector according to corresponding to sample text and corresponding Tag along sort carries out machine learning training, obtains disaggregated model.

In one embodiment, target text is social group name, and corresponding tag along sort is group's purposes label；It calculates Machine program also makes processor execute following steps: for statistical analysis to group's purposes label in social network-i i-platform；According to The result of statistical analysis screens group's purposes label；In social group corresponding with the group's purposes label filtered out, recommends and sieve The corresponding information of group's purposes label selected.

In one embodiment, target text is media content name；Corresponding tag along sort is media content type mark Label；Computer program also makes processor execute following steps: obtaining the corresponding media content type mark of target user's mark Label；Inquire media content name corresponding with the media content type label obtained；Push is identified according to target user to inquire Media content name and corresponding media content.

In one embodiment, a kind of storage medium for being stored with computer program is provided, computer program is processed When device executes, so that processor executes following steps: obtaining target text；Word segmentation processing is carried out to target text, is obtained multiple Word segment；From multiple word segments, take respectively less than or equal to preset quantity threshold value and the continuous word piece in target text Section constitutes portmanteau word segment；Preset quantity threshold value is the maximum quantity for constituting the word segment of a portmanteau word segment；According to each word Segment and each portmanteau word segment, generate the feature vector of target text.

It should be understood that although each step in each embodiment of the application is not necessarily to indicate according to step numbers Sequence successively execute.Unless expressly stating otherwise herein, there is no stringent sequences to limit for the execution of these steps, these Step can execute in other order.Moreover, in each embodiment at least part step may include multiple sub-steps or Multiple stages, these sub-steps or stage are not necessarily to execute completion in synchronization, but can be at different times Execute, these sub-steps perhaps the stage execution sequence be also not necessarily successively carry out but can with other steps or its The sub-step or at least part in stage of its step execute in turn or alternately.

Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the program can be stored in a non-volatile computer and can be read In storage medium, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, provided herein Each embodiment used in any reference to memory, storage, database or other media, may each comprise non-volatile And/or volatile memory.Nonvolatile memory may include that read-only memory (ROM), programming ROM (PROM), electricity can be compiled Journey ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include random access memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms, such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate sdram (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) directly RAM (RDRAM), straight Connect memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..

Each technical characteristic of above embodiments can be combined arbitrarily, for simplicity of description, not to above-described embodiment In each technical characteristic it is all possible combination be all described, as long as however, the combination of these technical characteristics be not present lance Shield all should be considered as described in this specification.

Only several embodiments of the present invention are expressed for above embodiments, and the description thereof is more specific and detailed, but can not Therefore it is construed as limiting the scope of the patent.It should be pointed out that for those of ordinary skill in the art, Under the premise of not departing from present inventive concept, various modifications and improvements can be made, and these are all within the scope of protection of the present invention. Therefore, the scope of protection of the patent of the invention shall be subject to the appended claims.

Claims

1. a kind of text handling method, which comprises

Obtain target text；

From the multiple word segment, take respectively less than or equal to preset quantity threshold value and continuous in the target text Word segment constitutes portmanteau word segment；The preset quantity threshold value is the maximum quantity for constituting the word segment of a portmanteau word segment；

2. the method according to claim 1, wherein described from the multiple word segment, take respectively be less than or Continuous word segment, composition portmanteau word segment include: equal to preset quantity threshold value and in the target text

Obtain preset quantity threshold value；

In the range of being greater than 1 and being less than or equal to the preset quantity threshold value, successively chooses integer and reference is used as to choose quantity；

From the multiple word segment, quantity is chosen by each reference respectively and takes the continuous word piece in the target text Section constitutes portmanteau word segment.

3. according to the method described in claim 2, it is characterized in that, described from the multiple word segment, respectively by each described The continuous word segment in the target text is taken with reference to quantity is chosen, constituting portmanteau word segment includes:

From the starting word segment in multiple word segments, current word segment is chosen one by one；

Respectively according to each described with reference to quantity is chosen, from being chosen in the target text continuous word piece current word segment Section；

The word segment by current word segment and accordingly chosen constitutes portmanteau word segment.

4. the method according to claim 1, wherein described according to each institute's predicate segment and each portmanteau word piece Section, the feature vector for generating target text include:

Each word segment and each corresponding statistical characteristics of portmanteau word segment are obtained respectively；

Determine each word segment and each portmanteau word segment matched word in the dictionary；

It is added to each statistical characteristics as vector element in vector template at position corresponding with the matched word, and It sets and corresponds at the position for the word for not being matched to any word segment and portmanteau word segment in the dictionary in the vector template Vector element be default value, obtain the feature vector of target text；Each position and the dictionary in the vector template In word correspond.

5. method according to claim 1 to 4, which is characterized in that further include:

The feature vector of the target text is input in disaggregated model, corresponding tag along sort is exported；

The tag along sort is marked for the target text.

6. according to the method described in claim 5, it is characterized by further comprising:

Obtain sample text and corresponding tag along sort；

Generate feature vector corresponding with the sample text；

Machine learning training is carried out according to feature vector corresponding to the sample text and corresponding tag along sort, is classified Model.

7. according to the method described in claim 5, it is characterized in that, the target text is social group name, corresponding point Class label is group's purposes label；

The method also includes:

It is for statistical analysis to group's purposes label in social network-i i-platform；

Group's purposes label is screened according to the result of statistical analysis；

In social group corresponding with the group's purposes label filtered out, recommend corresponding with the group's purposes label filtered out Information.

8. according to the method described in claim 5, it is characterized in that, the target text is media content name；Corresponding point Class label is media content type label；

The method also includes:

Obtain the corresponding media content type label of target user's mark；

Inquire media content name corresponding with the media content type label obtained；

The media content name and corresponding media content that push inquires are identified according to the target user.

9. a kind of text processing apparatus, which is characterized in that described device includes:

Composite module, for being taken respectively less than or equal to preset quantity threshold value and in the target from the multiple word segment Continuous word segment in text constitutes portmanteau word segment；The preset quantity threshold value is the word piece for constituting a portmanteau word segment The maximum quantity of section；

Vector generation module, for according to each institute's predicate segment and each portmanteau word segment, generate the feature of target text to Amount.

10. device according to claim 9, which is characterized in that the composite module is also used to obtain preset quantity threshold value； Obtain preset quantity threshold value；In the range of being greater than 1 and being less than or equal to the preset quantity threshold value, integer is successively chosen as ginseng Examine selection quantity；From the multiple word segment, quantity is chosen by each reference respectively and is taken in the target text continuously Word segment, constitute portmanteau word segment.

11. device according to claim 10, which is characterized in that the composite module is also used to from multiple word segments It originates word segment to rise, chooses current word segment one by one；Respectively according to each described with reference to selection quantity, the selection from current word segment The continuous word segment in the target text；The word segment by current word segment and accordingly chosen constitutes portmanteau word segment.

12. the device according to claim 9 to 11, which is characterized in that described device further include:

Categorization module exports corresponding tag along sort for the feature vector of the target text to be input in disaggregated model； The tag along sort is marked for the target text.

13. device according to claim 12, which is characterized in that the target text is social group name, accordingly Tag along sort is group's purposes label；

Described device further include:

Information recommendation module, for for statistical analysis to group's purposes label in social network-i i-platform；According to statistical analysis As a result group's purposes label is screened；In the social group corresponding with the group's purposes label filtered out, recommend to filter out with described Group's corresponding information of purposes label.

14. a kind of computer equipment, including memory and processor, computer program, the meter are stored in the memory When calculation machine program is executed by processor, so that the processor executes the step such as any one of claims 1 to 8 the method Suddenly.

15. a kind of storage medium for being stored with computer program, when the computer program is executed by processor, so that processor It executes such as the step of any one of claims 1 to 8 the method.