CN110020420A - Text handling method, device, computer equipment and storage medium - Google Patents
Text handling method, device, computer equipment and storage medium Download PDFInfo
- Publication number
- CN110020420A CN110020420A CN201810023358.4A CN201810023358A CN110020420A CN 110020420 A CN110020420 A CN 110020420A CN 201810023358 A CN201810023358 A CN 201810023358A CN 110020420 A CN110020420 A CN 110020420A
- Authority
- CN
- China
- Prior art keywords
- word segment
- word
- target text
- segment
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
This application involves a kind of text handling method, device, computer equipment and storage mediums, this method comprises: obtaining target text;Word segmentation processing is carried out to the target text, obtains multiple word segments;It from the multiple word segment, takes respectively less than or equal to preset quantity threshold value and the continuous word segment in the target text, constitutes portmanteau word segment;The preset quantity threshold value is the maximum quantity for constituting the word segment of a portmanteau word segment;According to each institute's predicate segment and each portmanteau word segment, the feature vector of target text is generated.The scheme of the application can more accurately represent the semantic information of target text.
Description
Technical field
The present invention relates to field of computer technology, more particularly to a kind of text handling method, device, computer equipment and
Storage medium.
Background technique
With the development of science and technology, the requirement to text intelligent processing is higher and higher.
In conventional method, after carrying out word segmentation processing to text, the semantic of text directly can be obtained according to word segmentation result and believed
Breath, text based semantic information perform corresponding processing.However, word segmentation processing often misses out some semantemes of text,
In the conventional way in the semantic information of text that is directly obtained according to word segmentation result it is not accurate enough.
Summary of the invention
Based on this, it is necessary to not quasi- enough for the semantic information of the text directly obtained according to word segmentation result in conventional method
True problem provides a kind of text handling method, device, computer equipment and storage medium.
A kind of text handling method, which comprises
Obtain target text;
Word segmentation processing is carried out to the target text, obtains multiple word segments;
From the multiple word segment, is taken respectively less than or equal to preset quantity threshold value and connected in the target text
Continuous word segment constitutes portmanteau word segment;The preset quantity threshold value is to constitute the maximum of the word segment of a portmanteau word segment
Quantity;
According to each institute's predicate segment and each portmanteau word segment, the feature vector of target text is generated.
A kind of text processing apparatus, described device include:
Word segmentation module, for obtaining target text;Word segmentation processing is carried out to the target text, obtains multiple word segments;
Composite module, for being taken respectively less than or equal to preset quantity threshold value and described from the multiple word segment
Continuous word segment in target text constitutes portmanteau word segment;The preset quantity threshold value is to constitute a portmanteau word segment
The maximum quantity of word segment;
Vector generation module, for generating the spy of target text according to each institute's predicate segment and each portmanteau word segment
Levy vector.
A kind of computer equipment, including memory and processor are stored with computer program, the meter in the memory
When calculation machine program is executed by processor, so that the processor executes following steps:
Obtain target text;
Word segmentation processing is carried out to the target text, obtains multiple word segments;
From the multiple word segment, is taken respectively less than or equal to preset quantity threshold value and connected in the target text
Continuous word segment constitutes portmanteau word segment;The preset quantity threshold value is to constitute the maximum of the word segment of a portmanteau word segment
Quantity;
According to each institute's predicate segment and each portmanteau word segment, the feature vector of target text is generated.
A kind of storage medium being stored with computer program, when the computer program is executed by processor, so that processing
Device executes following steps:
Obtain target text;
Word segmentation processing is carried out to the target text, obtains multiple word segments;
From the multiple word segment, is taken respectively less than or equal to preset quantity threshold value and connected in the target text
Continuous word segment constitutes portmanteau word segment;The preset quantity threshold value is to constitute the maximum of the word segment of a portmanteau word segment
Quantity;
According to each institute's predicate segment and each portmanteau word segment, the feature vector of target text is generated.
Above-mentioned text handling method, device, computer equipment and storage medium carry out word segmentation processing to target text and obtain
After word segment, from multiple word segments, take respectively less than or equal to preset quantity threshold value and the continuous word in target text
Segment constitutes portmanteau word segment, the ordinal characteristics in target text between each word segment can be embodied in the portmanteau word segment.
According to each word segment and each portmanteau word segment, the feature vector of target text is generated, had both included at participle in this feature vector
Manage obtained vector characteristics includes the ordinal characteristics between each word segment again, can more accurately represent target text
Semantic information.
Detailed description of the invention
Fig. 1 is the flow diagram of text handling method in one embodiment;
Fig. 2 is the schematic illustration that feature vector generates in one embodiment;
Fig. 3 is the schematic illustration of text handling method in one embodiment;
Fig. 4 to Fig. 5 is the effect diagram classified in one embodiment using the application text handling method;
Fig. 6 is the flow diagram of text handling method in another embodiment;
Fig. 7 is the block diagram of text processing apparatus in one embodiment;
Fig. 8 is the block diagram of text processing apparatus in another embodiment;
Fig. 9 is the block diagram of text processing apparatus in another embodiment;
Figure 10 is the schematic diagram of internal structure of computer equipment in one embodiment;
Figure 11 is the schematic diagram of internal structure of computer equipment in one embodiment.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right
The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and
It is not used in the restriction present invention.
Fig. 1 is the flow diagram of text handling method in one embodiment.The present embodiment is mainly with this article present treatment side
Method is applied to computer equipment and comes for example, the computer equipment can be terminal or server.Referring to Fig.1, this method has
Body includes the following steps:
S102 obtains target text.
Wherein, target text is the text for needing to be indicated with feature vector.In one embodiment, target text can be with
It is short text.The text size of short text is shorter, usually text of the length within 100 words.
In one embodiment, target text may include the social group group name of media content name, social network-i i-platform
The short texts such as the individualized signature claim, issued in social network-i i-platform or status information.Social network-i i-platform is carried out by network
Social platform.Social network-i i-platform may include immediate communication platform (for example, wechat, WeChat, are that Tencent releases
One provides the application program of instant messaging service for intelligent terminal).
It is appreciated that target text is also possible to long text, for example, article or media content etc..
S104 carries out word segmentation processing to target text, obtains multiple word segments.
Wherein, word segmentation processing is to divide the content of text of target text, is divided into multiple word segments.
In one embodiment, computer equipment can carry out context semantic analysis to target text, according to semanteme point
Analysis is as a result, carry out word segmentation processing for target text.
In one embodiment, computer equipment can be by each word progress in target text and preset dictionary
Match, word segmentation processing is carried out to target text according to matching result.
For example, target text is " Zhanjiang people is in Guangzhou ", then word segmentation processing can be carried out to the target text, obtain word piece
Section " Zhanjiang ", " people ", " " and " Guangzhou ".
S106 takes less than or equal to preset quantity threshold value and continuous in target text respectively from multiple word segments
Word segment constitutes portmanteau word segment.
Wherein, preset quantity threshold value is the maximum quantity for constituting the word segment of a portmanteau word segment.
The continuous word segment in target text, is adjacent word segment in target text.For example, target text is
" people from Zhanjiang is in Guangzhou ", then word segment " Zhanjiang " and " people " are the continuous word segment in target text, and " Zhanjiang " and
" " is just not belonging in target text continuous word segment, because " Zhanjiang " and " " is spaced in target text
" people ".
Specifically, computer equipment can from multiple word segments, take respectively less than or equal to the preset quantity threshold value and
Continuous word segment, that is, the word segment taken in target text meet quantity less than or equal to the preset quantity threshold value and in mesh
Mark this continuous condition in text.
In one embodiment, computer equipment is carrying out word segmentation processing to target text, after obtaining multiple word segments, meter
Calculating machine equipment can also obtain according to part of speech label from word segmentation processing for the corresponding part of speech label of each word fragment label
Crucial word segment is chosen in multiple word segments.For example, " ", " " etc. words be auxiliary word, the meaning of target text is not had
Preferable expressional function, it is possible to from target text remove " " and " " etc. have auxiliary word label word, remained
Remaining crucial word segment.Computer equipment can be taken respectively from the word segment of multiple keys less than or equal to preset quantity
Threshold value and the continuous word segment in the target text.
Computer equipment can be spliced the word segment taken by the sequencing in target text, obtain portmanteau word
Segment.
For example, the sequence of " Zhanjiang " in target text " Zhanjiang people is in Guangzhou " is wanted in { " Zhanjiang ", " people " } this combination
Prior to " people ", so the word segment in the combination is spliced, available { Zhanjiang according to the sequence in target text
People } this portmanteau word segment.
S108 generates the feature vector of target text according to each word segment and each portmanteau word segment.
Specifically, the available pre-stored dictionary of computer equipment by each word segment and each portmanteau word segment and is preset
Dictionary matched, according to matching result, generate the feature vector of target text.Wherein, summarizing in preset dictionary has not
Duplicate word.It is appreciated that the pre-stored dictionary obtained is according to word segment and portmanteau word segment to initial dictionary
The dictionary obtained after being updated.
Above-mentioned text handling method, after obtaining word segment to target text progress word segmentation processing, from multiple word segments,
It takes respectively less than or equal to preset quantity threshold value and the continuous word segment in target text, constitutes portmanteau word segment, the combination
The ordinal characteristics in target text between each word segment can be embodied in word segment.According to each word segment and each portmanteau word piece
Section, generates the feature vector of target text, had not only included the vector characteristics that word segmentation processing obtains in this feature vector but also included
Ordinal characteristics between each word segment can more accurately represent the semantic information of target text.
In one embodiment, step S106 includes: to obtain preset quantity threshold value;From greater than 1 and less than or equal to described pre-
If in the range of amount threshold, successively choosing integer and being used as with reference to selection quantity;From multiple word segments, respectively by each with reference to choosing
Access measures the continuous word segment in target text, constitutes portmanteau word segment.
Wherein, preset quantity threshold value is the maximum quantity for constituting the word segment of a portmanteau word segment.
Specifically, computer equipment can successively be selected in the range of being greater than 1 and being less than or equal to the preset quantity threshold value
Round numbers is used as with reference to selection quantity.Computer equipment can be taken as each with reference to selection quantity respectively from multiple word segments
Continuous word segment in target text constitutes portmanteau word segment.
It is appreciated that taking the continuous word segment in target text with reference to selection quantity by each, refer to the word segment taken
Quantity is that quantity is chosen in reference and the word segment taken is continuous in target text.
For example, preset quantity threshold value is 4, in the range of being greater than 1 and being less than or equal to 4, integer is successively chosen as reference
Quantity is chosen, then is 2,3 and 4 with reference to quantity is chosen.Computer equipment can take continuous 2 from multiple word segments respectively
Word segment, continuous 3 word segments and continuous 4 word segments.
In above-described embodiment, in the range of being greater than 1 and being less than or equal to the preset quantity threshold value, successively chooses integer and make
For with reference to selection quantity;From multiple word segments, the continuous word segment in target text is taken with reference to selection quantity by each respectively,
Multiple portmanteau word segments with ordinal characteristics can be more fully obtained, so that the feature vector of the target text ultimately generated
In, not only included the vector characteristics that word segmentation processing obtains but also had included the ordinal characteristics between each word segment, it can be more accurate
Ground represents the semantic information of target text.
In one embodiment, it from multiple word segments, is taken respectively by each with reference to selection quantity continuous in target text
Word segment, constitute portmanteau word segment include: from the starting word segment in multiple word segments, one by one choose current word segment;
Continuous word segment is chosen in target text from current word segment with reference to quantity is chosen according to each respectively;By current word piece
Section and the word segment accordingly chosen constitute portmanteau word segment.
Wherein, word segment is originated, is to be located at the first word segment in target text.It should be noted that choosing one by one
Current word segment is the ascending order of the sequence according to multiple word segments in target text, chooses current word segment one by one.
Specifically, since computer equipment can choose current word piece from multiple word segments originating word segment one by one
Section is chosen in target text continuous word segment with reference to quantity is chosen according to each respectively from current word segment;By current word
Segment and the word segment accordingly chosen constitute portmanteau word segment.It is appreciated that current word segment and the word segment accordingly chosen
The sum of quantity is equal to reference to selection quantity.
It is appreciated that computer equipment is being chosen in target text continuous word segment according to a current word segment
Afterwards, next word segment can be chosen from multiple word segments as current word according to the ascending order of the sequence in target text
Segment, and the step of repeating foregoing description continues to execute, such iteration, chooses stop condition until meeting.
Wherein, stop condition is chosen, the condition for choosing word segment is off.Choose stop condition, the group that can be
The quantity for closing word segment reaches preset threshold, alternatively, the word segment for having chosen preset quantity has been carried out as current word segment
Processing is stated, or, above-mentioned iterative cycles terminate.
It should be noted that when subsequent word segment is as current word segment, it is understood that there may be current word segment and back
Word segment the sum of quantity be unable to satisfy part with reference to choose quantity the case where, in this case, then can ignore the reference
Quantity is chosen, is not handled accordingly.
For example, target text be " people from Zhanjiang is in Guangzhou ", multiple word segments be respectively " Zhanjiang ", " people ", " " and " extensively
State ", preset quantity threshold value are 4, then are respectively 2,3 and 4 with reference to quantity is chosen.It is appreciated that because current word segment and selection
The sum of the adjacent quantity of word segment of timing be equal to reference to quantity is chosen, so choosing quantity in reference is respectively 2,3 and 4
When, 1,2 and 3 word segment for needing since " Zhanjiang " the sequence of selection respectively adjacent, then the word piece of available selection
Section is respectively { " Zhanjiang ", " people " }, { " Zhanjiang ", " people ", " " } and { " Zhanjiang ", " people ", " ", " Guangzhou " }.Then, it calculates
Machine equipment can regard next word segment " people " as current word segment, it will be understood that 2 word pieces are only remained after word segment " people "
Section can then ignore the reference so the sum of the quantity of " people " and remaining 2 word segments is unable to satisfy with reference to quantity 4 is chosen
Quantity is chosen not deal with.Computer equipment can since " people " adjacent 1 and the 2 word segments of selection sequence, then can be with
The word segment chosen is respectively { " people ", " " } and { " people ", " ", " Guangzhou " }.Followed by computer equipment can incite somebody to action
Next word segment " " is used as current word segment, and 1 adjacent word segment " Guangzhou " of selection sequence, is chosen since " "
Word segment be respectively { " ", " Guangzhou " }.Since " Guangzhou " is the last one word segment, when as current word segment, behind
There is no word segment, does not then deal with.
In above-described embodiment, by choosing current word segment one by one from the starting word segment in multiple word segments, respectively
According to each with reference to quantity is chosen, continuous word segment is chosen in target text from current word segment;By current word segment and
The word segment accordingly chosen constitutes portmanteau word segment, so that obtained portmanteau word segment had not only had ordinal characteristics but also the company of maintaining
Continuous property enables portmanteau word segment more accurately to indicate file destination so that semanteme caused by being optionally combined be avoided to destroy,
And then it is more accurate based on the feature vector that the portmanteau word segment and word segment obtain.
In one embodiment, step S108 includes: to obtain each word segment respectively and each portmanteau word segment counts accordingly
Characteristic value;Each word segment and each portmanteau word segment matched word in dictionary are determined respectively;Using each statistical characteristics as to
Secondary element is added in vector template at position corresponding with matched word, is not corresponded in dictionary not in juxtaposition vector template
The vector element being fitted at the position of the word of any word segment and portmanteau word segment is default value, obtains the feature of target text
Vector;The word in each position and dictionary in vector template corresponds.
Wherein, statistical characteristics is feature determined by statistical and characterization word segment or portmanteau word segment
Value.Statistical characteristics can be word frequency (TF, termfrequency), inverse file frequency (IDF, inversedocument
) or word frequency-inverse file frequency (TF-IDF) etc. frequency.Word frequency is the frequency that some word occurs in affiliated text
Rate.Inverse file frequency, is the measurement of a word general importance, and the inverse file frequency of a certain word can be removed by general act number
With the number of the file comprising the word, then logarithm is taken to obtain the obtained quotient.Word frequency-inverse file frequency is word frequency and inverse text
The product of part frequency.
It is appreciated that summarizing in dictionary has unduplicated word.It is appreciated that dictionary is according to word segment and portmanteau word
The dictionary that segment obtains after being updated to initial dictionary.
It should be noted that computer equipment can carry out signature analysis to each word segment and each portmanteau word segment respectively,
Corresponding statistical characteristics is obtained, existing each word segment and each corresponding statistical nature of portmanteau word segment can also be directly acquired
Value.
Specifically, computer equipment can respectively by each word segment and each portmanteau word segment with it is each in pre-stored dictionary
Word is matched, and each word segment and each portmanteau word segment matched word in dictionary are obtained.Computer equipment can determine
In dictionary matched word position corresponding in vector template.Computer equipment can using each statistical characteristics as to
Secondary element is added at corresponding matched word position corresponding in vector template, and is determined not match in dictionary and be taken office
The word of what word segment and portmanteau word segment, by the element vector at the word not being matched to position corresponding in vector template
Element is set to default value, to obtain the feature vector of target text.Wherein, each position in vector template and the word in dictionary
It corresponds.In one embodiment, default value can be 0.
In one embodiment, obtaining each word segment and each corresponding statistical characteristics of portmanteau word segment respectively includes: point
The word frequency of each word segment and each portmanteau word segment in target text is not calculated.It is added each statistical characteristics as vector element
Into vector template at position corresponding with matched word, corresponds in dictionary in juxtaposition vector template and be not matched to any word
Vector element at the position of the word of segment and portmanteau word segment is default value, and the feature vector for obtaining target text includes:
It is added to each word frequency as vector element in vector template at position corresponding with matched word, and will be right in vector template
It should not be matched to the vector element at the position of the word of any word segment and portmanteau word segment in dictionary and be set to default value, obtain
To the feature vector of target text.
Fig. 2 is the schematic illustration that feature vector generates in one embodiment.Referring to Fig. 2, the word in dictionary is { profound
River, people, Guangzhou, the U.S., Shanghai, people from Zhanjiang, people from Zhanjiang, people from Zhanjiang Guangzhou, people, people in Guangzhou, in Guangzhou, it is false
Corresponding words segment and portmanteau word segment as obtained target text in the manner described above are respectively " Zhanjiang ", " people ", " ", " wide
State ", " people from Zhanjiang ", " people from Zhanjiang exists ", " people from Zhanjiang is in Guangzhou ", " people exists ", " people is in Guangzhou " and " in Guangzhou ", by each word segment
Word frequency with portmanteau word segment is as statistical characteristics, by the word not being matched in dictionary " U.S. " and " Shanghai " in vector mould
The vector element of the corresponding position of plate is set to 0, the feature vector of available target text be (1,1,1,1,0,0,1,1,1,
1,1,1)。
In above-described embodiment, by the way that each word segment and each portmanteau word segment are matched with pre-stored dictionary;It will be each
Statistical characteristics is added in vector template at position corresponding with the matched word of institute as vector element, by dictionary not
The word matched vector element of corresponding position in vector template is set to default value, obtains the feature vector of target text.It should
Not only included the vector characteristics that word segmentation processing obtains in obtained feature vector but also included the ordinal characteristics between each word segment,
The semantic information of target text can more accurately be represented.
In one embodiment, this method further include: the feature vector of target text is input in disaggregated model, is exported
Corresponding tag along sort;For target text labeled bracketing label.
Wherein, disaggregated model is to carry out machine learning previously according to the feature vector of sample text and corresponding tag along sort
The model that training obtains.Disaggregated model, the feature vector for the text according to input exports corresponding tag along sort, with determination
Classification belonging to the object that the text is identified.For example, exporting corresponding contingency table for the social group name according to input
It signs, classification belonging to the social group identified with the determining social activity group name.
Specifically, the feature vector of target text can be input in disaggregated model by computer equipment, and output is corresponding
Tag along sort.Computer equipment can mark corresponding tag along sort for target text.
In above-described embodiment, based on the feature vector that can more accurately express target text, point of target text is obtained
Class label improves the accuracy of the classification to target text.
In one embodiment, this method include thes steps that disaggregated model training, specifically includes the following steps: obtaining sample
This text and corresponding tag along sort;Generate feature vector corresponding with sample text;The feature according to corresponding to sample text
The corresponding tag along sort of vector sum carries out machine learning training, obtains disaggregated model.
Specifically, the available sample text of computer equipment and corresponding tag along sort.Computer equipment can be to sample
This text carries out signature analysis, generates corresponding feature vector, computer equipment can be according to the feature corresponding to sample text
The corresponding tag along sort of vector sum carries out machine learning training, obtains disaggregated model.The disaggregated model, for the text according to input
This feature vector exports corresponding tag along sort, to determine classification belonging to object that the text is identified.
In one embodiment, computer equipment can be based on multinomial Bayesian Classification Arithmetic (multinomial model
(multinomial bayesian classifier) or neural network algorithm (Neural Networks), in conjunction with sample text
Feature vector and corresponding tag along sort corresponding to this carry out machine learning training, obtain disaggregated model.
In one embodiment, target text is social group name, and corresponding tag along sort is group's purposes label.The party
Method further include: for statistical analysis to group's purposes label in social network-i i-platform;Group is screened according to the result of statistical analysis to use
Way label;In social group corresponding with the group's purposes label filtered out, recommend corresponding with the group's purposes label filtered out
Information.
Wherein, group's purposes label is the label for characterizing social group purposes.Group purposes label may include work, family,
The labels such as marketing, classmate, education, reading or shopping.Social network-i i-platform is that social platform is carried out by network.Social network
Network platform may include immediate communication platform (for example, wechat, WeChat, are that one of Tencent's release mentions for intelligent terminal
For the application program of instant messaging service) and social information sharing platform.Social information sharing platform, be by sharing information with
Social platform is realized, for example, microblogging, blog, forum or discussion bar etc..
Specifically, computer equipment can be for statistical analysis to group's purposes label in social network-i i-platform, according to system
The result of meter analysis screens group's purposes label.In one embodiment, it is right can to count each group's purposes label institute for computer equipment
The social group's quantity answered screens group's purposes label according to social group's quantity.Computer equipment can filter out corresponding social activity
Group purposes label of group's quantity in preceding presetting digit capacity.It is appreciated that social group's quantity corresponding to group's purposes label, is group
The quantity for the social group that purposes label is identified.If there is corresponding group's purposes label in a social group, illustrate the social activity
Social group's purposes that the purposes of group is characterized for this group of purposes labels.
Computer equipment can obtain corresponding information for the group's purposes label filtered out.Wherein, the information of acquisition,
The group's purposes characterized with group's purposes label matches, for realizing this group of purposes.Computer equipment can with filter out
Group's purposes label in social group, recommends acquired information corresponding with group's purposes label that is filtering out accordingly.
For example, the group's purposes label filtered out includes label reading, then available article corresponding with the label reading,
So it can recommend the article obtained in social group corresponding with group's label is read.For another example, the group's purposes mark filtered out
Label include shopping label, then available resource promotion message (for example, advertising information) corresponding with the shopping label, with read
It reads to recommend the resource promotion message obtained in the corresponding social group of group's label.
In above-described embodiment, based on the feature vector that can more accurately express social group name, obtains corresponding group and use
Way label, this group of purposes labels can more accurately represent the purposes of social group represented by social group name, from
And it is more accurate based on group's purposes label group's purposes label for statistical analysis filtered out.And then with group's purposes for filtering out
Label in social group, recommends information corresponding with the group's purposes label filtered out, can be improved the standard of information recommendation accordingly
True property.
In one embodiment, target text is media content name;Corresponding tag along sort is media content type mark
Label.This method further include: obtain the corresponding media content type label of target user's mark;Inquiry and the media content obtained
The corresponding media content name of type label;The media content name and corresponding matchmaker that push inquires are identified according to target user
Hold in vivo.
Wherein, media content name is the name to media content.Media content is the information that can be used for transmitting, propagate
Content.Media content type label is the label for characterizing media content type.Media content type may include sport, amusement,
The type of the media contents event such as culture, political and military.Target user's mark, is to push to media content pushed information
User identifier.
It is appreciated that media content type label corresponding with target user's mark, identifies institute for characterizing target user
Interested media content type.
Specifically, the corresponding relationship between user identifier and media content type label is stored in computer equipment, is counted
The available target user's mark of machine equipment is calculated, and according to the corresponding relationship, is obtained in the corresponding media of target user's mark
Hold type label.Computer equipment can also be from searching matchmaker corresponding with target user's mark in database or from other equipment
Hold type label in vivo.
The corresponding relationship being pre-stored in computer equipment between media content name and media content type label.One
In a embodiment, there is media content type label for media content name correspondence markings in computer equipment.Computer equipment
Media content name corresponding with the media content type label obtained can be inquired according to the corresponding relationship.
The available media content corresponding with the media content name inquired of computer equipment.It is appreciated that calculating
The corresponding relationship of media content name and media content can be stored in advance in machine equipment, according to the corresponding relationship, obtain with
The corresponding media content of the media content name inquired.Computer equipment can also be obtained from database or other equipment with
The corresponding media content of the media content name inquired.
Computer equipment can be identified according to target user, be pushed in the media content name inquired and corresponding media
Hold.In one embodiment, computer equipment can push to media content pushed information corresponding to target user's mark
Terminal.
In above-described embodiment, based on the feature vector that can more accurately express media content name, respective media is obtained
Content type label, the media content type label can more accurately represent in media represented by media content name
The type of appearance.According to the corresponding media content type label of target user's mark, corresponding media content name is obtained, so that
The media content name of acquisition is more accurate, more meets the demand of user.In turn, by media content name and corresponding matchmaker
Hold in vivo to target user's mark and push, improves the accuracy of the media content pushed to target user.
Fig. 3 is the schematic illustration of text handling method in one embodiment.Referring to Fig. 3, computer equipment is to target text
This progress word segmentation processing obtains word segment 1 to 3, and preset quantity threshold value is 3, then can be by quantity 2 and 3 as with reference to selection number
Amount.Quantity 2, which is chosen, according to reference from 3 word segments chooses continuous word segment, available word segment 1 and 2 and word piece
Section 2 and 3.According to the word segment 1 and 2 of selection, available portmanteau word segment A is available according to the word segment 2 and 3 of selection
Portmanteau word segment B.Quantity 3 is chosen according to reference and chooses continuous word segment, and available word segment 1,2 and 3 obtains portmanteau word
Segment C.Computer equipment and can be determined by word segment 1 to 3 and portmanteau word segment A to C respectively with word match in dictionary
Position of the matched word in vector template in dictionary.The available word segment 1 to 3 of computer equipment and portmanteau word piece
The statistical characteristics of section A to C, the statistical characteristics filling word segment 1 to 3 and portmanteau word segment A to C that will acquire are in dictionary
Matched word is at the position in vector template.
Fig. 4 to Fig. 5 is the effect diagram classified in one embodiment using the application text handling method.Reference
Fig. 4, Fig. 4 are that the text vector generated using traditional segmenting method carries out classification processing, obtain the probability that text belongs to each classification
Crosstab.The crosstab is now explained by taking the first row as an example." classification 1 " of the first row indicates classification belonging to text reality, the
The probability in percent characterization text of a line carries out classification processing by the feature vector that tradition participle generates, and belonging to for obtaining is each
The probability of classification belongs to the probability of " classification 2 " of secondary series for example, the probability for belonging to " classification 1 " of first row is 93.43%
It is 0.13% etc..Fig. 5 is that the feature vector obtained using the text handling method in the embodiment of the present application carries out classification processing,
Obtain the crosstab that text belongs to the probability of each classification.The crosstab is equally illustrated by taking the first row as an example.The first row " point
Class 1 " indicates classification belonging to text reality, and the probability in percent characterization text of the first row passes through the text in the embodiment of the present application
The feature vector that treatment method obtains carries out classification processing, and what is obtained belongs to the probability of each classification, for example, belonging to first row
The probability of " classification 1 " is 94.05%, and the probability for belonging to " classification 2 " of secondary series is 0.12% etc..It is obvious that being matched in Fig. 5
The probability being higher than mostly in Fig. 4 to the probability correctly classified, for example, text actually belongs to " classification 1 " in Fig. 5, and obtained category
In the probability of " classification 1 " be 94.05%, the probability 93.43% being higher than in Fig. 4, be matched in explanatory diagram 5 correctly classify it is general
Rate is higher than Fig. 4, i.e., carries out classification processing, phase according to the feature vector that the text handling method provided in the embodiment of the present application obtains
Classification processing is carried out compared with the text vector generated according to traditional segmenting method, the accuracy of classification is higher.
As shown in fig. 6, in one embodiment, providing another text handling method, this method specifically includes following
Step:
S602 obtains social group name;Word segmentation processing is carried out to social group name, obtains multiple word segments.
S604, obtains preset quantity threshold value, and preset quantity threshold value is to constitute the maximum of the word segment of a portmanteau word segment
Quantity.
S606 successively chooses integer and reference is used as to choose number in the range of being greater than 1 and being less than or equal to preset quantity threshold value
Amount.
S608 chooses current word segment from the starting word segment in multiple word segments one by one;Respectively according to each with reference to choosing
Access amount is chosen in target text continuous word segment from current word segment.
S610, the word segment by current word segment and accordingly chosen constitute portmanteau word segment.
S612 obtains each word segment and each corresponding statistical characteristics of portmanteau word segment respectively;Each word segment is determined respectively
With each portmanteau word segment in dictionary matched word.
Each statistical characteristics is added in vector template position corresponding with matched word by S614
Place, correspond in juxtaposition vector template at the position for the word for not being matched to any word segment and portmanteau word segment in dictionary to
Secondary element is default value, obtains the feature vector of target text.
Wherein, each position in vector template and the word in dictionary correspond.
The feature vector of social group name is input in disaggregated model by S616, exports corresponding group's purposes label;Needle
This group of purposes labels are marked to social group name.
In one embodiment, this method further includes disaggregated model training step, specifically includes the following steps: obtaining sample
Text and corresponding tag along sort;Generate feature vector corresponding with sample text;The feature according to corresponding to sample text to
Amount and corresponding tag along sort carry out machine learning training, obtain disaggregated model.
S618, it is for statistical analysis to group's purposes label in social network-i i-platform.
S620 screens group's purposes label according to the result of statistical analysis, in society corresponding with the group's purposes label filtered out
In the social group for handing over group name to be characterized, recommend information corresponding with the group's purposes label filtered out.
Above-mentioned text handling method, after obtaining word segment to target text progress word segmentation processing, from multiple word segments
Different combinations is taken, each combined word segment is merged by the sequence in target text, obtains portmanteau word segment, the combination
The ordinal characteristics in target text between each word segment can be embodied in word segment.According to each word segment and each portmanteau word piece
Section, generates the feature vector of target text, had not only included the vector characteristics that word segmentation processing obtains in this feature vector but also included
Ordinal characteristics between each word segment can more accurately represent the semantic information of target text.
As shown in fig. 7, in one embodiment, providing a kind of text processing apparatus 700, which includes: participle mould
Block 702, composite module 704 and vector generation module 706, in which:
Word segmentation module 702, for obtaining target text;Word segmentation processing is carried out to target text, obtains multiple word segments.
Composite module 704, for from the multiple word segment, take respectively less than or equal to preset quantity threshold value and
Continuous word segment in the target text constitutes portmanteau word segment;The preset quantity threshold value is to constitute a portmanteau word piece
The maximum quantity of the word segment of section.
Vector generation module 706, for according to each word segment and each portmanteau word segment, generate the feature of target text to
Amount.
In one embodiment, composite module 704 is also used to obtain preset quantity threshold value;Obtain preset quantity threshold value;From
Greater than 1 and it is less than or equal in the range of the preset quantity threshold value, successively chooses integer and be used as with reference to selection quantity;From described more
In a word segment, quantity is chosen by each reference respectively and takes the continuous word segment in the target text, constitutes portmanteau word
Segment.
In one embodiment, composite module 704 is also used to from the starting word segment in multiple word segments, is chosen one by one
Current word segment;Quantity is chosen according to each reference respectively, it is continuous from being chosen at current word segment in the target text
Word segment;The word segment by current word segment and accordingly chosen constitutes portmanteau word segment.
In one embodiment, vector generation module 706 is also used to obtain each word segment and each portmanteau word segment phase respectively
The statistical characteristics answered;Each word segment and each portmanteau word segment matched word in the dictionary are determined respectively;By each statistics
Characteristic value is added in vector template at position corresponding with the matched word as vector element, vector mould described in juxtaposition
It is corresponding to the vector element at the position for the word for not being matched to any word segment and portmanteau word segment in the dictionary in plate
Default value obtains the feature vector of target text;The word in each position and the dictionary in the vector template is one by one
It is corresponding.
As shown in figure 8, in one embodiment, the device 700 further include:
Categorization module 708 exports corresponding contingency table for the feature vector of target text to be input in disaggregated model
Label;For target text labeled bracketing label.
In one embodiment, categorization module 708 is also used to obtain sample text and corresponding tag along sort;Generation and sample
The corresponding feature vector of this text;The feature vector according to corresponding to sample text and corresponding tag along sort carry out machine learning
Training, obtains disaggregated model.
As shown in figure 9, in one embodiment, target text is social group name, corresponding tag along sort is group's use
Way label.The device 700 further include:
Information recommendation module 710, for for statistical analysis to group's purposes label in social network-i i-platform;According to statistics
The result of analysis screens group's purposes label;In social group corresponding with the group's purposes label filtered out, recommend and the sieve
The corresponding information of group's purposes label selected.
In one embodiment, target text is media content name;Corresponding tag along sort is media content type mark
Label.Information recommendation module 710 is also used to obtain the corresponding media content type label of target user's mark;Inquiry and acquisition
The corresponding media content name of media content type label;It is identified according to the target user in the media that push inquires
Hold title and corresponding media content.
Figure 10 is the schematic diagram of internal structure of computer equipment in one embodiment.The computer equipment includes passing through system
Processor, memory and the network interface of bus connection.Wherein, memory includes non-volatile memory medium and built-in storage.
The non-volatile memory medium of the computer equipment can storage program area and computer program.The computer program is performed
When, it may make processor to execute a kind of text handling method.The processor of the computer equipment calculates and controls energy for providing
Power supports the operation of entire computer equipment.Computer program can be stored in the built-in storage, the computer program is processed
When device executes, processor may make to execute a kind of text handling method.The network interface of computer equipment is logical for carrying out network
Letter.
It will be understood by those skilled in the art that structure shown in Figure 10, only part relevant to application scheme
The block diagram of structure, does not constitute the restriction for the computer equipment being applied thereon to application scheme, and specific computer is set
Standby may include perhaps combining certain components or with different component layouts than more or fewer components as shown in the figure.
Figure 11 is the schematic diagram of internal structure of computer equipment in one embodiment.The computer equipment includes passing through system
Processor, memory, network interface, display screen and the input unit of bus connection.Wherein, memory includes non-volatile memories
Medium and built-in storage.The non-volatile memory medium of the computer equipment can storage program area and computer program.The meter
Calculation machine program is performed, and processor may make to execute a kind of text handling method.The processor of the computer equipment is for mentioning
For calculating and control ability, the operation of entire computer equipment is supported.Computer program can be stored in the built-in storage, the meter
When calculation machine program is executed by processor, processor may make to execute a kind of text handling method.The network interface of computer equipment
For carrying out network communication.The display screen of computer equipment can be liquid crystal display or electric ink display screen etc..It calculates
The input unit of machine equipment can be the touch layer covered on display screen, be also possible to the key being arranged in terminal enclosure, track
Ball or Trackpad are also possible to external keyboard, Trackpad or mouse etc..The computer equipment can be personal computer, move
Dynamic terminal or mobile unit, mobile terminal include in mobile phone, tablet computer, personal digital assistant or wearable device etc. at least
It is a kind of.
It will be understood by those skilled in the art that structure shown in Figure 11, only part relevant to application scheme
The block diagram of structure, does not constitute the restriction for the computer equipment being applied thereon to application scheme, and specific computer is set
Standby may include perhaps combining certain components or with different component layouts than more or fewer components as shown in the figure.
In one embodiment, text processing apparatus provided by the present application can be implemented as a kind of shape of computer program
Formula, computer program can be run in the computer equipment as shown in Figure 10 or Figure 11, the non-volatile memories of computer equipment
Medium can store each program module of composition text processing unit, for example, word segmentation module shown in Fig. 7 702, composite module
704 and vector generation module 706.Computer program composed by each program module is for making the computer equipment execute sheet
Step in the text handling method of each embodiment of the application described in specification, for example, computer equipment can pass through
Word segmentation module 702 in text processing apparatus 700 as shown in Figure 7 obtains target text;Word segmentation processing is carried out to target text,
Obtain multiple word segments, and through composite module 704 from multiple word segments, take respectively less than or equal to preset quantity threshold value,
And the continuous word segment in target text, constitute portmanteau word segment;Preset quantity threshold value is to constitute a portmanteau word segment
The maximum quantity of word segment.Computer equipment can by vector generation module 706 according to each word segment and each portmanteau word segment,
Generate the feature vector of target text.
In one embodiment, a kind of computer equipment, including memory and processor are provided, is stored in memory
Computer program, when computer program is executed by processor, so that processor executes following steps: obtaining target text;To mesh
It marks text and carries out word segmentation processing, obtain multiple word segments;From multiple word segments, taken respectively less than or equal to preset quantity threshold
Value and the continuous word segment in target text constitute portmanteau word segment;Preset quantity threshold value is to constitute a portmanteau word segment
Word segment maximum quantity;According to each word segment and each portmanteau word segment, the feature vector of target text is generated.
In one embodiment, it from multiple word segments, is taken respectively less than or equal to preset quantity threshold value and in target text
Continuous word segment in this, constituting portmanteau word segment includes: to obtain preset quantity threshold value;From greater than 1 and less than or equal to present count
In the range of measuring threshold value, successively chooses integer and be used as with reference to selection quantity;From multiple word segments, respectively by each with reference to selection number
The continuous word segment in target text is measured, portmanteau word segment is constituted.
In one embodiment, it from multiple word segments, is taken respectively by each with reference to selection quantity continuous in target text
Word segment, constitute portmanteau word segment include: from the starting word segment in multiple word segments, one by one choose current word segment;
Continuous word segment is chosen in target text from current word segment with reference to quantity is chosen according to each respectively;By current word piece
Section and the word segment accordingly chosen constitute portmanteau word segment.
In one embodiment, according to each word segment and each portmanteau word segment, the feature vector for generating target text includes:
Each word segment and each corresponding statistical characteristics of portmanteau word segment are obtained respectively;Each word segment and each portmanteau word segment are determined respectively
The matched word in dictionary;It is added in vector template using each statistical characteristics as vector element corresponding with matched word
Position at, in juxtaposition vector template correspond to dictionary in be not matched to any word segment and portmanteau word segment word position
The vector element at place is default value, obtains the feature vector of target text;The word in each position and dictionary in vector template
Language corresponds.
In one embodiment, computer program also make processor execute following steps: by the feature of target text to
Amount is input in disaggregated model, exports corresponding tag along sort;For target text labeled bracketing label.
In one embodiment, computer program also makes processor execute following steps: obtaining sample text and corresponding
Tag along sort;Generate feature vector corresponding with sample text;The feature vector according to corresponding to sample text and corresponding
Tag along sort carries out machine learning training, obtains disaggregated model.
In one embodiment, target text is social group name, and corresponding tag along sort is group's purposes label;It calculates
Machine program also makes processor execute following steps: for statistical analysis to group's purposes label in social network-i i-platform;According to
The result of statistical analysis screens group's purposes label;In social group corresponding with the group's purposes label filtered out, recommends and sieve
The corresponding information of group's purposes label selected.
In one embodiment, target text is media content name;Corresponding tag along sort is media content type mark
Label;Computer program also makes processor execute following steps: obtaining the corresponding media content type mark of target user's mark
Label;Inquire media content name corresponding with the media content type label obtained;Push is identified according to target user to inquire
Media content name and corresponding media content.
In one embodiment, a kind of storage medium for being stored with computer program is provided, computer program is processed
When device executes, so that processor executes following steps: obtaining target text;Word segmentation processing is carried out to target text, is obtained multiple
Word segment;From multiple word segments, take respectively less than or equal to preset quantity threshold value and the continuous word piece in target text
Section constitutes portmanteau word segment;Preset quantity threshold value is the maximum quantity for constituting the word segment of a portmanteau word segment;According to each word
Segment and each portmanteau word segment, generate the feature vector of target text.
In one embodiment, it from multiple word segments, is taken respectively less than or equal to preset quantity threshold value and in target text
Continuous word segment in this, constituting portmanteau word segment includes: to obtain preset quantity threshold value;From greater than 1 and less than or equal to present count
In the range of measuring threshold value, successively chooses integer and be used as with reference to selection quantity;From multiple word segments, respectively by each with reference to selection number
The continuous word segment in target text is measured, portmanteau word segment is constituted.
In one embodiment, it from multiple word segments, is taken respectively by each with reference to selection quantity continuous in target text
Word segment, constitute portmanteau word segment include: from the starting word segment in multiple word segments, one by one choose current word segment;
Continuous word segment is chosen in target text from current word segment with reference to quantity is chosen according to each respectively;By current word piece
Section and the word segment accordingly chosen constitute portmanteau word segment.
In one embodiment, according to each word segment and each portmanteau word segment, the feature vector for generating target text includes:
Each word segment and each corresponding statistical characteristics of portmanteau word segment are obtained respectively;Each word segment and each portmanteau word segment are determined respectively
The matched word in dictionary;It is added in vector template using each statistical characteristics as vector element corresponding with matched word
Position at, in juxtaposition vector template correspond to dictionary in be not matched to any word segment and portmanteau word segment word position
The vector element at place is default value, obtains the feature vector of target text;The word in each position and dictionary in vector template
Language corresponds.
In one embodiment, computer program also make processor execute following steps: by the feature of target text to
Amount is input in disaggregated model, exports corresponding tag along sort;For target text labeled bracketing label.
In one embodiment, computer program also makes processor execute following steps: obtaining sample text and corresponding
Tag along sort;Generate feature vector corresponding with sample text;The feature vector according to corresponding to sample text and corresponding
Tag along sort carries out machine learning training, obtains disaggregated model.
In one embodiment, target text is social group name, and corresponding tag along sort is group's purposes label;It calculates
Machine program also makes processor execute following steps: for statistical analysis to group's purposes label in social network-i i-platform;According to
The result of statistical analysis screens group's purposes label;In social group corresponding with the group's purposes label filtered out, recommends and sieve
The corresponding information of group's purposes label selected.
In one embodiment, target text is media content name;Corresponding tag along sort is media content type mark
Label;Computer program also makes processor execute following steps: obtaining the corresponding media content type mark of target user's mark
Label;Inquire media content name corresponding with the media content type label obtained;Push is identified according to target user to inquire
Media content name and corresponding media content.
It should be understood that although each step in each embodiment of the application is not necessarily to indicate according to step numbers
Sequence successively execute.Unless expressly stating otherwise herein, there is no stringent sequences to limit for the execution of these steps, these
Step can execute in other order.Moreover, in each embodiment at least part step may include multiple sub-steps or
Multiple stages, these sub-steps or stage are not necessarily to execute completion in synchronization, but can be at different times
Execute, these sub-steps perhaps the stage execution sequence be also not necessarily successively carry out but can with other steps or its
The sub-step or at least part in stage of its step execute in turn or alternately.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with
Relevant hardware is instructed to complete by computer program, the program can be stored in a non-volatile computer and can be read
In storage medium, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, provided herein
Each embodiment used in any reference to memory, storage, database or other media, may each comprise non-volatile
And/or volatile memory.Nonvolatile memory may include that read-only memory (ROM), programming ROM (PROM), electricity can be compiled
Journey ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include random access memory
(RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms, such as static state RAM
(SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate sdram (DDRSDRAM), enhanced SDRAM
(ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) directly RAM (RDRAM), straight
Connect memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..
Each technical characteristic of above embodiments can be combined arbitrarily, for simplicity of description, not to above-described embodiment
In each technical characteristic it is all possible combination be all described, as long as however, the combination of these technical characteristics be not present lance
Shield all should be considered as described in this specification.
Only several embodiments of the present invention are expressed for above embodiments, and the description thereof is more specific and detailed, but can not
Therefore it is construed as limiting the scope of the patent.It should be pointed out that for those of ordinary skill in the art,
Under the premise of not departing from present inventive concept, various modifications and improvements can be made, and these are all within the scope of protection of the present invention.
Therefore, the scope of protection of the patent of the invention shall be subject to the appended claims.
Claims (15)
1. a kind of text handling method, which comprises
Obtain target text;
Word segmentation processing is carried out to the target text, obtains multiple word segments;
From the multiple word segment, take respectively less than or equal to preset quantity threshold value and continuous in the target text
Word segment constitutes portmanteau word segment;The preset quantity threshold value is the maximum quantity for constituting the word segment of a portmanteau word segment;
According to each institute's predicate segment and each portmanteau word segment, the feature vector of target text is generated.
2. the method according to claim 1, wherein described from the multiple word segment, take respectively be less than or
Continuous word segment, composition portmanteau word segment include: equal to preset quantity threshold value and in the target text
Obtain preset quantity threshold value;
In the range of being greater than 1 and being less than or equal to the preset quantity threshold value, successively chooses integer and reference is used as to choose quantity;
From the multiple word segment, quantity is chosen by each reference respectively and takes the continuous word piece in the target text
Section constitutes portmanteau word segment.
3. according to the method described in claim 2, it is characterized in that, described from the multiple word segment, respectively by each described
The continuous word segment in the target text is taken with reference to quantity is chosen, constituting portmanteau word segment includes:
From the starting word segment in multiple word segments, current word segment is chosen one by one;
Respectively according to each described with reference to quantity is chosen, from being chosen in the target text continuous word piece current word segment
Section;
The word segment by current word segment and accordingly chosen constitutes portmanteau word segment.
4. the method according to claim 1, wherein described according to each institute's predicate segment and each portmanteau word piece
Section, the feature vector for generating target text include:
Each word segment and each corresponding statistical characteristics of portmanteau word segment are obtained respectively;
Determine each word segment and each portmanteau word segment matched word in the dictionary;
It is added to each statistical characteristics as vector element in vector template at position corresponding with the matched word, and
It sets and corresponds at the position for the word for not being matched to any word segment and portmanteau word segment in the dictionary in the vector template
Vector element be default value, obtain the feature vector of target text;Each position and the dictionary in the vector template
In word correspond.
5. method according to claim 1 to 4, which is characterized in that further include:
The feature vector of the target text is input in disaggregated model, corresponding tag along sort is exported;
The tag along sort is marked for the target text.
6. according to the method described in claim 5, it is characterized by further comprising:
Obtain sample text and corresponding tag along sort;
Generate feature vector corresponding with the sample text;
Machine learning training is carried out according to feature vector corresponding to the sample text and corresponding tag along sort, is classified
Model.
7. according to the method described in claim 5, it is characterized in that, the target text is social group name, corresponding point
Class label is group's purposes label;
The method also includes:
It is for statistical analysis to group's purposes label in social network-i i-platform;
Group's purposes label is screened according to the result of statistical analysis;
In social group corresponding with the group's purposes label filtered out, recommend corresponding with the group's purposes label filtered out
Information.
8. according to the method described in claim 5, it is characterized in that, the target text is media content name;Corresponding point
Class label is media content type label;
The method also includes:
Obtain the corresponding media content type label of target user's mark;
Inquire media content name corresponding with the media content type label obtained;
The media content name and corresponding media content that push inquires are identified according to the target user.
9. a kind of text processing apparatus, which is characterized in that described device includes:
Word segmentation module, for obtaining target text;Word segmentation processing is carried out to the target text, obtains multiple word segments;
Composite module, for being taken respectively less than or equal to preset quantity threshold value and in the target from the multiple word segment
Continuous word segment in text constitutes portmanteau word segment;The preset quantity threshold value is the word piece for constituting a portmanteau word segment
The maximum quantity of section;
Vector generation module, for according to each institute's predicate segment and each portmanteau word segment, generate the feature of target text to
Amount.
10. device according to claim 9, which is characterized in that the composite module is also used to obtain preset quantity threshold value;
Obtain preset quantity threshold value;In the range of being greater than 1 and being less than or equal to the preset quantity threshold value, integer is successively chosen as ginseng
Examine selection quantity;From the multiple word segment, quantity is chosen by each reference respectively and is taken in the target text continuously
Word segment, constitute portmanteau word segment.
11. device according to claim 10, which is characterized in that the composite module is also used to from multiple word segments
It originates word segment to rise, chooses current word segment one by one;Respectively according to each described with reference to selection quantity, the selection from current word segment
The continuous word segment in the target text;The word segment by current word segment and accordingly chosen constitutes portmanteau word segment.
12. the device according to claim 9 to 11, which is characterized in that described device further include:
Categorization module exports corresponding tag along sort for the feature vector of the target text to be input in disaggregated model;
The tag along sort is marked for the target text.
13. device according to claim 12, which is characterized in that the target text is social group name, accordingly
Tag along sort is group's purposes label;
Described device further include:
Information recommendation module, for for statistical analysis to group's purposes label in social network-i i-platform;According to statistical analysis
As a result group's purposes label is screened;In the social group corresponding with the group's purposes label filtered out, recommend to filter out with described
Group's corresponding information of purposes label.
14. a kind of computer equipment, including memory and processor, computer program, the meter are stored in the memory
When calculation machine program is executed by processor, so that the processor executes the step such as any one of claims 1 to 8 the method
Suddenly.
15. a kind of storage medium for being stored with computer program, when the computer program is executed by processor, so that processor
It executes such as the step of any one of claims 1 to 8 the method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810023358.4A CN110020420B (en) | 2018-01-10 | 2018-01-10 | Text processing method, device, computer equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810023358.4A CN110020420B (en) | 2018-01-10 | 2018-01-10 | Text processing method, device, computer equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110020420A true CN110020420A (en) | 2019-07-16 |
CN110020420B CN110020420B (en) | 2023-07-21 |
Family
ID=67188115
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810023358.4A Active CN110020420B (en) | 2018-01-10 | 2018-01-10 | Text processing method, device, computer equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110020420B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110413738A (en) * | 2019-07-31 | 2019-11-05 | 腾讯科技(深圳)有限公司 | A kind of information processing method, device, server and storage medium |
CN110704621A (en) * | 2019-09-25 | 2020-01-17 | 北京大米科技有限公司 | Text processing method and device, storage medium and electronic equipment |
CN110781296A (en) * | 2019-09-16 | 2020-02-11 | 中国平安人寿保险股份有限公司 | Data classification method based on deep learning and related equipment thereof |
CN111142728A (en) * | 2019-12-26 | 2020-05-12 | 腾讯科技(深圳)有限公司 | Vehicle-mounted environment intelligent text processing method and device, electronic equipment and storage medium |
CN111311455A (en) * | 2020-01-17 | 2020-06-19 | 广东德诚科教有限公司 | Examination information matching method and device, computer equipment and storage medium |
CN111626055A (en) * | 2020-05-25 | 2020-09-04 | 泰康保险集团股份有限公司 | Text processing method and device, computer storage medium and electronic equipment |
CN112559732A (en) * | 2019-09-25 | 2021-03-26 | 阿里巴巴集团控股有限公司 | Text processing method, device and system |
Citations (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2002342324A (en) * | 2001-05-16 | 2002-11-29 | Nippon Telegr & Teleph Corp <Ntt> | Method and device for dividing text, text dividing program and storage medium with t he program stored therein |
CN103116588A (en) * | 2011-11-17 | 2013-05-22 | 腾讯科技(深圳)有限公司 | Method and system for personalized recommendation |
CN104008171A (en) * | 2014-06-03 | 2014-08-27 | 中国科学院计算技术研究所 | Legal database establishing method and legal retrieving service method |
CN104572688A (en) * | 2013-10-17 | 2015-04-29 | 腾讯科技(深圳)有限公司 | Information push method and device |
CN104615608A (en) * | 2014-04-28 | 2015-05-13 | 腾讯科技(深圳)有限公司 | Data mining processing system and method |
WO2015149533A1 (en) * | 2014-03-31 | 2015-10-08 | 北京奇虎科技有限公司 | Method and device for word segmentation processing on basis of webpage content classification |
CN105488077A (en) * | 2014-10-10 | 2016-04-13 | 腾讯科技(深圳)有限公司 | Content tag generation method and apparatus |
CN105653553A (en) * | 2014-11-14 | 2016-06-08 | 腾讯科技(深圳)有限公司 | Term weight generation method and device |
CN105791543A (en) * | 2016-02-23 | 2016-07-20 | 北京奇虎科技有限公司 | Method, device, client and system for cleaning short messages |
CN105808524A (en) * | 2016-03-11 | 2016-07-27 | 江苏畅远信息科技有限公司 | Patent document abstract-based automatic patent classification method |
CN106095996A (en) * | 2016-06-22 | 2016-11-09 | 量子云未来(北京)信息科技有限公司 | Method for text classification |
CN106445974A (en) * | 2015-08-12 | 2017-02-22 | 腾讯科技(深圳)有限公司 | Data recommendation method and apparatus |
CN106502989A (en) * | 2016-10-31 | 2017-03-15 | 东软集团股份有限公司 | Sentiment analysis method and device |
CN106528642A (en) * | 2016-10-13 | 2017-03-22 | 广东广业开元科技有限公司 | TF-IDF feature extraction based short text classification method |
CN106557463A (en) * | 2016-10-31 | 2017-04-05 | 东软集团股份有限公司 | Sentiment analysis method and device |
CN106897428A (en) * | 2017-02-27 | 2017-06-27 | 腾讯科技(深圳)有限公司 | Text classification feature extracting method, file classification method and device |
CN107085581A (en) * | 2016-02-16 | 2017-08-22 | 腾讯科技(深圳)有限公司 | Short text classification method and device |
CN107105030A (en) * | 2017-04-20 | 2017-08-29 | 腾讯科技(深圳)有限公司 | Promotional content method for pushing and device |
-
2018
- 2018-01-10 CN CN201810023358.4A patent/CN110020420B/en active Active
Patent Citations (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2002342324A (en) * | 2001-05-16 | 2002-11-29 | Nippon Telegr & Teleph Corp <Ntt> | Method and device for dividing text, text dividing program and storage medium with t he program stored therein |
CN103116588A (en) * | 2011-11-17 | 2013-05-22 | 腾讯科技(深圳)有限公司 | Method and system for personalized recommendation |
CN104572688A (en) * | 2013-10-17 | 2015-04-29 | 腾讯科技(深圳)有限公司 | Information push method and device |
WO2015149533A1 (en) * | 2014-03-31 | 2015-10-08 | 北京奇虎科技有限公司 | Method and device for word segmentation processing on basis of webpage content classification |
CN104615608A (en) * | 2014-04-28 | 2015-05-13 | 腾讯科技(深圳)有限公司 | Data mining processing system and method |
CN104008171A (en) * | 2014-06-03 | 2014-08-27 | 中国科学院计算技术研究所 | Legal database establishing method and legal retrieving service method |
CN105488077A (en) * | 2014-10-10 | 2016-04-13 | 腾讯科技(深圳)有限公司 | Content tag generation method and apparatus |
CN105653553A (en) * | 2014-11-14 | 2016-06-08 | 腾讯科技(深圳)有限公司 | Term weight generation method and device |
CN106445974A (en) * | 2015-08-12 | 2017-02-22 | 腾讯科技(深圳)有限公司 | Data recommendation method and apparatus |
CN107085581A (en) * | 2016-02-16 | 2017-08-22 | 腾讯科技(深圳)有限公司 | Short text classification method and device |
CN105791543A (en) * | 2016-02-23 | 2016-07-20 | 北京奇虎科技有限公司 | Method, device, client and system for cleaning short messages |
CN105808524A (en) * | 2016-03-11 | 2016-07-27 | 江苏畅远信息科技有限公司 | Patent document abstract-based automatic patent classification method |
CN106095996A (en) * | 2016-06-22 | 2016-11-09 | 量子云未来(北京)信息科技有限公司 | Method for text classification |
CN106528642A (en) * | 2016-10-13 | 2017-03-22 | 广东广业开元科技有限公司 | TF-IDF feature extraction based short text classification method |
CN106502989A (en) * | 2016-10-31 | 2017-03-15 | 东软集团股份有限公司 | Sentiment analysis method and device |
CN106557463A (en) * | 2016-10-31 | 2017-04-05 | 东软集团股份有限公司 | Sentiment analysis method and device |
CN106897428A (en) * | 2017-02-27 | 2017-06-27 | 腾讯科技(深圳)有限公司 | Text classification feature extracting method, file classification method and device |
CN107105030A (en) * | 2017-04-20 | 2017-08-29 | 腾讯科技(深圳)有限公司 | Promotional content method for pushing and device |
Non-Patent Citations (10)
Title |
---|
CXMSCB: "机器学习之朴素贝叶斯模型及代码示例", Retrieved from the Internet <URL:https://blog.csdn.net/cxmscb/article/details/69267326> * |
WIY_DAWN: "语句的向量表示方法——单词向量组合", 《CSDN博客》 * |
WIY_DAWN: "语句的向量表示方法——单词向量组合", 《CSDN博客》, 5 May 2017 (2017-05-05) * |
向阳等: "Agent驱动的中文本体智能构建研究", 《计算机工程与应用》 * |
向阳等: "Agent驱动的中文本体智能构建研究", 《计算机工程与应用》, vol. 45, no. 10, 1 April 2009 (2009-04-01), pages 133 - 137 * |
廖浩 等: "基于词语关联的文本特征词提取方法", 计算机应用, vol. 27, no. 12, pages 3009 - 3012 * |
王聪: "以"精准化"提升"重合度"——浅谈新媒体的广告传播及对双边市场的影响", 《新闻研究导刊》 * |
王聪: "以"精准化"提升"重合度"——浅谈新媒体的广告传播及对双边市场的影响", 《新闻研究导刊》, 10 December 2015 (2015-12-10) * |
陈鸿 等: "基于上下文特征分类的评论长句切分方法", 计算机工程, vol. 41, no. 9, pages 233 - 237 * |
黄旭: "基于机器学习的汉语短文本分类方法研究与实现", 中国优秀硕士学位论文全文数据库 信息科技辑, no. 02, pages 138 - 4413 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110413738A (en) * | 2019-07-31 | 2019-11-05 | 腾讯科技(深圳)有限公司 | A kind of information processing method, device, server and storage medium |
CN110781296A (en) * | 2019-09-16 | 2020-02-11 | 中国平安人寿保险股份有限公司 | Data classification method based on deep learning and related equipment thereof |
CN110704621A (en) * | 2019-09-25 | 2020-01-17 | 北京大米科技有限公司 | Text processing method and device, storage medium and electronic equipment |
CN112559732A (en) * | 2019-09-25 | 2021-03-26 | 阿里巴巴集团控股有限公司 | Text processing method, device and system |
CN111142728A (en) * | 2019-12-26 | 2020-05-12 | 腾讯科技(深圳)有限公司 | Vehicle-mounted environment intelligent text processing method and device, electronic equipment and storage medium |
CN111142728B (en) * | 2019-12-26 | 2022-06-03 | 腾讯科技(深圳)有限公司 | Vehicle-mounted environment intelligent text processing method and device, electronic equipment and storage medium |
CN111311455A (en) * | 2020-01-17 | 2020-06-19 | 广东德诚科教有限公司 | Examination information matching method and device, computer equipment and storage medium |
CN111311455B (en) * | 2020-01-17 | 2024-02-06 | 广东德诚科教有限公司 | Examination information matching method, examination information matching device, computer equipment and storage medium |
CN111626055A (en) * | 2020-05-25 | 2020-09-04 | 泰康保险集团股份有限公司 | Text processing method and device, computer storage medium and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN110020420B (en) | 2023-07-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110020420A (en) | Text handling method, device, computer equipment and storage medium | |
CN110544155B (en) | User credit score acquisition method, acquisition device, server and storage medium | |
CN110263235B (en) | Information push object updating method and device and computer equipment | |
CN109783730A (en) | Products Show method, apparatus, computer equipment and storage medium | |
CN109872162B (en) | Wind control classification and identification method and system for processing user complaint information | |
CN108595440B (en) | Short text content classification method and system | |
CN107341145A (en) | A kind of user feeling analysis method based on deep learning | |
CN112380859A (en) | Public opinion information recommendation method and device, electronic equipment and computer storage medium | |
CN107807958B (en) | Personalized article list recommendation method, electronic equipment and storage medium | |
Alahmadi et al. | Twitter-based recommender system to address cold-start: A genetic algorithm based trust modelling and probabilistic sentiment analysis | |
CN112215008A (en) | Entity recognition method and device based on semantic understanding, computer equipment and medium | |
CN108228808A (en) | Determine the method, apparatus of focus incident and storage medium and electronic equipment | |
CN107077640A (en) | Analyzed via experience ownership, it is qualification and intake unstructured data sources system and processing | |
CN109615504A (en) | Products Show method, apparatus, electronic equipment and computer readable storage medium | |
CN110046251A (en) | Community content methods of risk assessment and device | |
CN115392237B (en) | Emotion analysis model training method, device, equipment and storage medium | |
CN113360768A (en) | Product recommendation method, device and equipment based on user portrait and storage medium | |
CN111104590A (en) | Information recommendation method, device, medium and electronic equipment | |
CN108021713B (en) | Document clustering method and device | |
CN112765966B (en) | Method and device for removing duplicate of associated word, computer readable storage medium and electronic equipment | |
CN113656699A (en) | User feature vector determination method, related device and medium | |
CN110287270B (en) | Entity relationship mining method and equipment | |
CN116842478A (en) | User attribute prediction method based on twitter content | |
CN110377819A (en) | Arbitrator's recommended method, device and computer equipment based on big data | |
Fang et al. | Emaildetective: An email authorship identification and verification model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |