CN104615681A - Text selecting method and device - Google Patents

Text selecting method and device Download PDF

Info

Publication number
CN104615681A
CN104615681A CN201510030778.1A CN201510030778A CN104615681A CN 104615681 A CN104615681 A CN 104615681A CN 201510030778 A CN201510030778 A CN 201510030778A CN 104615681 A CN104615681 A CN 104615681A
Authority
CN
China
Prior art keywords
text
candidate
qualitative character
error rate
cryptographic hash
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510030778.1A
Other languages
Chinese (zh)
Other versions
CN104615681B (en
Inventor
王炜
田旭
李媛媛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Guangzhou Shenma Mobile Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Shenma Mobile Information Technology Co Ltd filed Critical Guangzhou Shenma Mobile Information Technology Co Ltd
Priority to CN201510030778.1A priority Critical patent/CN104615681B/en
Publication of CN104615681A publication Critical patent/CN104615681A/en
Application granted granted Critical
Publication of CN104615681B publication Critical patent/CN104615681B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries

Abstract

The embodiment of the invention discloses a text selecting method and device. The method comprises the steps that the error rate of selected quality features in each candidate text is calculated, and the tolerance of the selected quality features in each candidate text is calculated according to the error rate of the selected quality features in the candidate text; the text quality of the corresponding candidate text is determined according to the tolerance of the selected quality features in each candidate text; the candidate text with the highest text quality is selected and provided for a user. According to the scheme, the candidate text is fed back to the user in the manner that the text quality of the candidate texts is calculated first and then the candidate text with the highest text quality is provided for the user instead of the manner that the candidate texts are ranked according to the grabbing sequence, the user can directly acquire the optimal candidate text without browsing the candidate texts, and therefore the user experience is improved.

Description

Text selection method and device
Technical field
The present invention relates to networking technology area, particularly relate to a kind of text selection method and device.
Background technology
Along with the develop rapidly of Internet technology; the text that internet provides is more and more abundanter; such as novel, paper, film comment etc.; when user needs to search for text; key word (query) can be inputted in the search engine of client; owing to usually having the text that multiple website provides user to need; after server receives this key word; text corresponding to this key word alternatively text is captured from each website; then feed back to search engine after candidate's text being sorted according to the sequencing captured, select for user.
In said method, owing to sorting to candidate's text according to the sequencing captured, then user is fed back to, do not consider the quality of candidate's text, before what the candidate text quality that may cause feeding back to user was poor come, after quality comes preferably, user needs to browse multiple alternative file, just can find best candidate's text.Visible, candidate's text that said method cannot choose text quality the highest is supplied to user, has a strong impact on Consumer's Experience.
Summary of the invention
The embodiment of the present invention provides a kind of text selection method and device, is supplied to user, has a strong impact on the problem of Consumer's Experience in order to solve the candidate's text cannot choosing text quality the highest existed in prior art.
According to the embodiment of the present invention, a kind of text selection method is provided, comprises:
Calculate the error rate of selected qualitative character in each candidate's text, and calculate the tolerance of corresponding selected qualitative character in each candidate's text according to the error rate of qualitative character selected in each candidate's text;
The text quality of corresponding candidate's text is determined according to the tolerance of qualitative character selected in each candidate's text;
The candidate's text choosing text quality the highest is supplied to user.
Concrete, calculate the error rate of selected qualitative character in each candidate's text, specifically comprise:
For each candidate's text, perform:
The quantity of each selected qualitative character of statistics current candidate text;
Respectively by the quantity of each selected qualitative character of statistics divided by the character quantity of described current candidate text, obtain the actual accounting rate of each selected qualitative character of described current candidate text;
The correspondence calculating described current candidate text according to the actual accounting rate of each selected qualitative character of described current candidate text selectes the error rate of qualitative character.
Concrete, the correspondence calculating described current candidate text according to the actual accounting rate of each selected qualitative character of described current candidate text selectes the error rate of qualitative character, specifically comprises:
The error rate of qualitative character is selected by following formulae discovery:
The max-thresholds (1-selectes the standard accounting rate of the actual accounting rate/selected qualitative character of qualitative character) of the error rate of the error rate=selected qualitative character of selected qualitative character.
Concrete, calculate the tolerance of corresponding selected qualitative character in each candidate's text according to the error rate of qualitative character selected in each candidate's text, specifically comprise:
The tolerance of qualitative character is selected by following formulae discovery:
Tolerance=the 1-of selected qualitative character selectes the error rate ^ (1/ (error rate+1 of selected qualitative character)) of qualitative character.
Concrete, determine the text quality of corresponding candidate's text according to the tolerance of qualitative character selected in each candidate's text, specifically comprise:
Determine the weight of each selected qualitative character;
The text quality of corresponding candidate's text is obtained according to the tolerance weighted sum of weight to qualitative character selected in each candidate's text of each selected qualitative character.
Concrete, the candidate's text choosing text quality the highest is supplied to user, specifically comprises:
Calculate the cryptographic hash of each candidate's text;
According to the cryptographic hash determination advantage group of each candidate's text;
The candidate's text choosing text quality the highest from described advantage group is supplied to user.
Concrete, calculate the cryptographic hash of each candidate's text, specifically comprise:
For each candidate's text, perform:
By the text of described current candidate text according to setting symbol segmentation, obtain the statement that described current candidate text comprises;
Calculate the cryptographic hash of each statement that described current candidate text comprises;
The cryptographic hash of each statement that comprehensive described current candidate text comprises, obtains the cryptographic hash of described current candidate text.
Concrete, according to the cryptographic hash determination advantage group of each candidate's text, specifically comprise:
By all candidate's text combination of two, obtain candidate's text pair;
Calculate the Hamming distances of the right cryptographic hash of each candidate's text;
Choose two candidate's texts of the minimum candidate's text pair of Hamming distances as reference text;
Choose candidate's text of being less than with the Hamming distances of described referenced text and setting threshold value and described referenced text adds described advantage group.
According to the embodiment of the present invention, a kind of text selection device is also provided, comprises:
Computing unit, for calculating the error rate of selected qualitative character in each candidate's text, and calculates the tolerance of corresponding selected qualitative character in each candidate's text according to the error rate of qualitative character selected in each candidate's text;
Determining unit, for determining the text quality of corresponding candidate's text according to the tolerance of qualitative character selected in each candidate's text;
Choose unit, be supplied to user for the candidate's text choosing text quality the highest.
Concrete, described computing unit specifically comprises statistics subelement, actual accounting rate computation subunit and error rate computation subunit; Wherein,
Described statistics subelement, for for each candidate's text, performs: the quantity of each selected qualitative character of statistics current candidate text;
Described actual accounting rate computation subunit, for respectively by the quantity of each selected qualitative character of statistics divided by the character quantity of described current candidate text, obtain the actual accounting rate of each selected qualitative character of described current candidate text;
Described error rate computation subunit, the correspondence that the actual accounting rate for each selected qualitative character according to described current candidate text calculates described current candidate text selectes the error rate of qualitative character.
Concrete, described error rate computation subunit, specifically for:
The error rate of qualitative character is selected by following formulae discovery:
The max-thresholds (1-selectes the standard accounting rate of the actual accounting rate/selected qualitative character of qualitative character) of the error rate of the error rate=selected qualitative character of selected qualitative character.
Optionally, described computing unit also comprises tolerance computation subunit, for:
The tolerance of corresponding selected qualitative character in each candidate's text is calculated according to the error rate of qualitative character selected in each candidate's text.
Concrete, described tolerance computation subunit, specifically for:
The tolerance of qualitative character is selected by following formulae discovery:
Tolerance=the 1-of selected qualitative character selectes the error rate ^ (1/ (error rate+1 of selected qualitative character)) of qualitative character.
Concrete, described determining unit specifically comprises weight determination subelement and text quality's computation subunit; Wherein,
Described weight determination subelement, for determining the weight of each selected qualitative character;
Described text quality computation subunit, for obtaining the text quality of corresponding candidate's text according to the tolerance weighted sum of weight to qualitative character selected in each candidate's text of each selected qualitative character.
Concrete, described in choose unit and specifically comprise cryptographic hash computation subunit, advantage group determination subelement and text selection subelement; Wherein,
Cryptographic hash computation subunit, for calculating the cryptographic hash of each candidate's text;
Advantage group determination subelement, for the cryptographic hash determination advantage group according to each candidate's text;
Text selection subelement, is supplied to user for the candidate's text choosing text quality the highest from described advantage group.
Optionally, described cryptographic hash computation subunit also comprises segmentation subelement, statement cryptographic hash computation subunit and text cryptographic hash computation subunit; Wherein,
Described segmentation subelement, for for each candidate's text, performs: by the text of described current candidate text according to setting symbol segmentation, obtain the statement that described current candidate text comprises;
Described statement cryptographic hash computation subunit, for calculating the cryptographic hash of each statement that described current candidate text comprises;
Described text cryptographic hash computation subunit, for the cryptographic hash of each statement that comprehensive described current candidate text comprises, obtains the cryptographic hash of described current candidate text.
Optionally, described advantage group determination subelement also comprises combination subelement, Hamming distances computation subunit, referenced text choose subelement and add subelement; Wherein,
Described combination subelement, for by all candidate's text combination of two, obtains candidate's text pair;
Described Hamming distances computation subunit, for calculating the Hamming distances of the right cryptographic hash of each candidate's text;
Described referenced text chooses subelement, for choosing two candidate's texts of the minimum candidate's text pair of Hamming distances as reference text;
Describedly add subelement, for choosing candidate's text of being less than with the Hamming distances of described referenced text and setting threshold value and described referenced text adds described advantage group.
The embodiment of the present invention provides a kind of text selection method and device, by calculating the error rate of selected qualitative character in each candidate's text, and calculate the tolerance of corresponding selected qualitative character in each candidate's text according to the error rate of qualitative character selected in each candidate's text; The text quality of corresponding candidate's text is determined according to the tolerance of qualitative character selected in each candidate's text; The candidate's text choosing text quality the highest is supplied to user.In the program, not feed back to user after candidate's text being sorted according to the sequencing captured, but the text quality of first calculated candidate text, then candidate's text the highest for text quality is supplied to user, user is without the need to browsing multiple candidate's text, directly can get best candidate's text, thus promote Consumer's Experience.
Accompanying drawing explanation
In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, for those of ordinary skills, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.
Fig. 1 is the schematic flow sheet of a kind of text selection method in the embodiment of the present invention;
Fig. 2 is the process flow diagram of the implementation method of S13 in the embodiment of the present invention;
Fig. 3 is the schematic diagram of a kind of practical application scene in the embodiment of the present invention;
Fig. 4 is the structural representation of a kind of text selection device in the embodiment of the present invention.
Embodiment
User is supplied to for the candidate's text cannot choosing text quality the highest existed in prior art, have a strong impact on the problem of Consumer's Experience, the embodiment of the present invention provides a kind of text selection method, the flow process of the method as shown in Figure 1, executive agent can be, but not limited to be server, be described for server as executive agent below, concrete steps are as follows:
S11: the error rate calculating selected qualitative character in each candidate's text, and the tolerance calculating corresponding selected qualitative character in each candidate's text according to the error rate of qualitative character selected in each candidate's text.
Usually comprise punctuate, segmentation, common symbol, non-pinyin, non-mess code, non-asterisk etc. feature in candidate's text, therefore choose the selected qualitative character of the part or all of alternatively text of these features.
For each selected qualitative character in candidate's text, all can have optimal value, but for each candidate's text, selected qualitative character might not be all optimal value, therefore can represent the quality of selected qualitative character by error rate.
It is different that the difference of candidate's text selectes qualitative character for the impact of the reading experience of user, and namely the patient degree of user is different, user can be defined as the tolerance of selected qualitative character for the degrees of tolerance of selected qualitative character.Such as: the situation of a small amount of " replacing word with phonetic ", user is can be received, and at this moment tolerance is just large; Be all the situation of " replacing word with phonetic ", user is then unacceptable, and at this moment tolerance is just little.
S12: the text quality determining corresponding candidate's text according to the tolerance of qualitative character selected in each candidate's text.
Can obtain the text quality of this candidate's text further according to the tolerance of qualitative character selected in candidate's text, text quality is for characterizing the quality height of candidate's text.
S13: the candidate's text choosing text quality the highest is supplied to user.
The candidate's text can choosing text quality the highest is supplied to user, is supplied to user after certainly can also sorting to candidate's text according to text quality's order from high to low, thus is convenient to user and selects.
In the program, not feed back to user after candidate's text being sorted according to the sequencing captured, but the text quality of first calculated candidate text, then candidate's text the highest for text quality is supplied to user, user is without the need to browsing multiple candidate's text, directly can get best candidate's text, thus promote Consumer's Experience.
In the present invention, when user needs to search for text, key word (Query) can be inputted in the search engine of client, after server receives this key word, can obtain at least one candidate's text that this key word is corresponding, such as, user needs search novel " red sorghum ", just can input key word " red sorghum " on the search engine of client, server can obtain at least one candidate's text corresponding to " red sorghum ".For each candidate's text of " red sorghum ", error rate and the tolerance of selected qualitative character can be calculated.
But, in the present invention, in order to improve the search speed of user, improving the response speed of server, the said method of S11 to S13 can be performed in advance at server end, precompute the quality of some candidate's texts of each novel.When user is at client-side search keyword, candidate's text that in some candidate's texts of the novel for its keyword calculated, quality is the highest can be provided directly to user, and need not start anew to calculate the error rate of selected qualitative character in each candidate's text, the tolerance of selected qualitative character, then determine the operations such as the text quality of corresponding candidate's text according to the tolerance of qualitative character selected in each candidate's text.
Only have when server end does not precompute the quality of some candidate's texts of the novel searched for for user, just start anew to perform the said method of S11 to S13 in response to the searching request of user, namely at least one candidate's text corresponding to this key word is obtained according to the keyword of user's input, calculate the error rate of selected qualitative character in each candidate's text, the tolerance of selected qualitative character, the text quality of corresponding candidate's text is determined again according to the tolerance of qualitative character selected in each candidate's text, the candidate's text choosing text quality the highest is supplied to user.
Like this, both can reduce the search stand-by period of user, also can reduce the operand of server.
Concrete, the error rate of selected qualitative character in each candidate's text of the calculating in above-mentioned S11, specifically comprises:
For each candidate's text, perform:
The quantity of each selected qualitative character of statistics current candidate text;
Respectively by the quantity of each selected qualitative character of statistics divided by the character quantity of current candidate text, obtain the actual accounting rate of each selected qualitative character of current candidate text;
The error rate of qualitative character is selected according to the correspondence of the actual accounting rate calculating current candidate text of each selected qualitative character of current candidate text.
The current candidate's text of the error rate calculating selected qualitative character that needs is as current candidate text, first the quantity of each selected qualitative character in current candidate text can be added up, such as, if selected qualitative character comprises fullstop, comma, segmentation, then can add up the quantity of fullstop, the quantity of comma, the quantity of segmentation respectively; Then, determine the character quantity of current candidate text, and calculate the actual accounting rate of each selected qualitative character of current candidate text, such as, by the character quantity of the quantity of fullstop divided by current candidate text, obtain the actual accounting rate of fullstop, by the character quantity of the quantity of comma divided by current candidate text, obtain the actual accounting rate of comma, by the character quantity of the quantity of segmentation divided by current candidate text, obtain the actual accounting rate of segmentation, above-mentioned actual accounting rate is all the decimal between 0 to 1; Finally, the error rate of qualitative character is selected according to the correspondence of the actual accounting rate calculating current text of each selected qualitative character of current candidate text.
The error rate of qualitative character can be selected: the max-thresholds (1-selectes the standard accounting rate of the actual accounting rate/selected qualitative character of qualitative character) of the error rate of the error rate=selected qualitative character of selected qualitative character by following formulae discovery;
Wherein, the max-thresholds of the error rate of selected qualitative character is selected qualitative character value in the worst cases, concrete numerical value selectes qualitative character according to difference to determine the combined influence of user's reading experience, can obtain by manually carrying out evaluation and test, such as, by the degree sequence that the worst case of each selected qualitative character can be tolerated user's reading experience, for completely without punctuate and a large amount of phonetic two kinds of situations, the complete situation without punctuate is can not be received, and the situation of a large amount of phonetic can be accepted, so the max-thresholds of the error rate of punctuate should be larger than the max-thresholds of the error rate of phonetic, the standard accounting rate of selected qualitative character is that the measured text of matter that the key word by inputting user is corresponding carries out statistical study and obtains, when the actual accounting rate of selected qualitative character is more than or equal to the standard accounting rate of this selected qualitative character, its definition of this selected qualitative character is 0, when selected quality actual accounting rate close to 0 time, its definition of this selected qualitative character is the max-thresholds of the error rate of this selected qualitative character.
Concrete, the tolerance calculating selected qualitative character according to the error rate of qualitative character selected in each candidate's text in above-mentioned S11, especially by following formulae discovery:
Tolerance=the 1-of selected qualitative character selectes the error rate ^ (1/ (error rate+1 of selected qualitative character)) of qualitative character;
Tolerance is larger, and the quality representing this candidate's text is better, and the reading experience of user is also better; Tolerance is less, and the quality representing this candidate's text is poorer, and the reading experience of user is also poorer.
Concrete, determine that the method for the text quality of corresponding candidate's text is specially according to the tolerance of qualitative character selected in each candidate's text in above-mentioned S12: the weight determining each selected qualitative character; The text quality of corresponding candidate's text is obtained according to the tolerance weighted sum of weight to qualitative character selected in each candidate's text of each selected qualitative character.
Wherein, determine that the weight of each selected qualitative character can adopt various ways to realize.Such as, manually can demarcate the weight of each selected qualitative character, then obtain the text quality of corresponding candidate's text according to the tolerance weighted sum of weight to qualitative character selected in each candidate's text of the artificial each selected qualitative character demarcated.Again such as, can also be obtained the weight of each selected qualitative character by the mode of machine learning, then the tolerance weighted sum of the weight of each selected qualitative character obtained according to machine learning to qualitative character selected in each candidate's text obtains the text quality of corresponding candidate's text.
The detailed process adopting machine learning mode to obtain the weight of each selected qualitative character is: first, and the artificial demarcation measured text of at least one matter (positive example sample) and at least one ropy text (negative data) are as training set; Then each weight generated in advance combination is sued for peace with each sample weighting in training set, obtain the text quality of each sample, positive example mark is obtained for positive example sample, counter-example mark is obtained for negative data; Finally, will occur in training sample that positive example mark combines lower than a kind of weight that counter-example mark is minimum, the weight as optimum combines.
In addition, determine that the method for the text quality of candidate's text can also be: when the tolerance of each selected qualitative character of current candidate text is in setting range, the tolerance of all selected qualitative character of current candidate text being multiplied obtains the text quality of current candidate text.Wherein, setting range can set according to actual needs, preferably (0,1), that is, if the tolerance of each selected qualitative character of current candidate text is when (0,1), the tolerance of all selected qualitative character of current candidate text can be multiplied and obtains the text quality of current candidate text.
Concrete, to be supplied to user can be that the candidate's text directly choosing each candidate's text Chinese version quality the highest is supplied to user to the highest candidate's text of text quality of choosing in above-mentioned S13.
The highest candidate's text of text quality of choosing in above-mentioned S13 is supplied to user and determines the advantage group in above-mentioned candidate's text, then the candidate's text choosing text quality the highest from advantage group is supplied to user.As shown in Figure 2, specifically comprise:
S21: the cryptographic hash calculating each candidate's text.
When calculating the cryptographic hash of each candidate's text, for each candidate's text, perform: by the text of current candidate text according to setting symbol segmentation, obtain the statement that current candidate text comprises; Calculate the cryptographic hash of each statement that current candidate text comprises; The cryptographic hash of each statement that comprehensive current candidate text comprises, obtains the cryptographic hash of current candidate text.
The cryptographic hash of each candidate's text can be calculated successively; candidate's text of current calculating is as current candidate text; usually Chinese, English, punctuation mark etc. can be included in current candidate text; English and punctuate etc. non-Chinese symbol can be chosen as setting symbol; then to set the text of symbol segmentation current candidate text; thus the statement that current candidate text comprises can be obtained, setting symbol can remove by preferred mode.
Then calculate the cryptographic hash of each statement that current candidate text comprises, existing algorithm can be adopted to calculate.
Finally, the cryptographic hash of each statement that comprehensive current candidate text comprises, obtains the cryptographic hash of current candidate text.Cryptographic hash due to each statement is the binary digit of 64, therefore when obtaining the cryptographic hash of current candidate text, first can generate the weight vectors of 64 bit elements.Then, change the value of this weight vectors according to the cryptographic hash of each statement: if n-th of the cryptographic hash of statement is 1, then give n-th of this weight vectors to add 1, otherwise subtract 1.Finally, arrange the cryptographic hash that this weight vectors obtains current candidate text: if n-th of weight vectors is more than or equal to 0, then the cryptographic hash n-th of current candidate text is 1, otherwise is 0.
S22: according to the cryptographic hash determination advantage group of each candidate's text.
Because candidate text quality is uneven, can first-selected screen candidate's text, group of gaining the upper hand, then selects the highest candidate's text of text quality from advantage group.Determine that the detailed process of advantage group is as follows: by all candidate's text combination of two, obtain candidate's text pair; Calculate the Hamming distances of the right cryptographic hash of each candidate's text; Choose two candidate's texts of the minimum candidate's text pair of Hamming distances as reference text; Choose candidate's text of being less than with the Hamming distances of referenced text and setting threshold value and referenced text adds advantage group.
Hamming distances between cryptographic hash can judge the similarity degree of two candidate's texts of candidate's text pair, Hamming distances is less, two candidate's texts are more similar, otherwise, two candidate's texts are more dissimilar, therefore can choose the minimum candidate's text of Hamming distances to as with reference to text, then the Hamming distances with referenced text is less than the candidate's text setting threshold value and add in advantage group with this referenced text.
Wherein, set threshold value can set according to actual needs.
In order to improve the efficiency of text selection, can also filter the candidate's text in advantage group further, according to text size and first preseting length of candidate's text in advantage group, the relation of the second preseting length is filtered the candidate's text in advantage group, filter out advantage group Chinese version length to be greater than the first preseting length or to be less than candidate's text of the second preseting length, detailed process is: first, setting percentage threshold, and calculate the average text size of candidate's text in advantage group, first preseting length can be set as average text size * (1+ percentage threshold), second preseting length can be set as average text size * (1-percentage threshold), percentage threshold can set according to actual needs, then, the text size of candidate's text each in advantage group and the first preseting length and the second preseting length are compared, filters out candidate's text that text size is greater than the first preseting length or is less than the second preseting length.
S23: the candidate's text choosing text quality the highest from advantage group is supplied to user.
Candidate's text that can realize choosing text quality the highest by above-mentioned S21-S23 is supplied to user.
In the text selection method of this enforcement, not feed back to user after candidate's text being sorted according to the sequencing captured, but the text quality of first calculated candidate text, then candidate's text the highest for text quality is supplied to user, user is without the need to browsing multiple candidate's text, directly can get best candidate's text, thus promote Consumer's Experience.Be illustrated in figure 3 the method scene in actual applications, suppose that user inputs key word and " dominates greatly " on the search engine of the client of Android (android) operating system, server sends to client after using above-mentioned text selection method to sort to candidate's text, the result that client receives is as picture leftmost in Fig. 3, if user chooses first candidate's text, client can show the particular content of first candidate's text, as the picture in the middle of Fig. 3, if user wants to read other version, then source can be switched, select in the source that rightmost picture comprises in such as Fig. 3, adopt in this way, user is easy to find optimum candidate's text, thus promote the experience of user.
Based on same inventive concept, the embodiment of the present invention provides a kind of text selection device, and this device can be arranged in the server, and structure as shown in Figure 4, comprising:
Computing unit 41, for calculating the error rate of selected qualitative character in each candidate's text, and calculates the tolerance of corresponding selected qualitative character in each candidate's text according to the error rate of qualitative character selected in each candidate's text;
Determining unit 42, for determining the text quality of corresponding candidate's text according to the tolerance of qualitative character selected in each candidate's text;
Choose unit 43, be supplied to user for the candidate's text choosing text quality the highest.
In the program, not feed back to user after candidate's text being sorted according to the sequencing captured, but the text quality of first calculated candidate text, then candidate's text the highest for text quality is supplied to user, user is without the need to browsing multiple candidate's text, directly can get best candidate's text, thus promote Consumer's Experience.
Concrete, computing unit 41 comprises statistics subelement, actual accounting rate computation subunit and error rate computation subunit: wherein,
Statistics subelement, for for each candidate's text, performs: the quantity of each selected qualitative character of statistics current candidate text;
Actual accounting rate computation subunit, for respectively by the quantity of each selected qualitative character of statistics divided by the character quantity of current candidate text, obtain the actual accounting rate of each selected qualitative character of current candidate text;
Error rate computation subunit, the correspondence for the actual accounting rate calculating current candidate text of each selected qualitative character according to current candidate text selectes the error rate of qualitative character.
Concrete, the error rate computation subunit in computing unit 41, specifically for:
The error rate of qualitative character is selected by following formulae discovery:
The max-thresholds (1-selectes the standard accounting rate of the actual accounting rate/selected qualitative character of qualitative character) of the error rate of the error rate=selected qualitative character of selected qualitative character.
Concrete, computing unit 41 also comprises tolerance computation subunit, for:
The tolerance of corresponding selected qualitative character in each candidate's text is calculated according to the error rate of qualitative character selected in each candidate's text.
Concrete, the tolerance computation subunit in computing unit 41, specifically for:
The tolerance of qualitative character is selected by following formulae discovery:
Tolerance=the 1-of selected qualitative character selectes the error rate ^ (1/ (error rate+1 of selected qualitative character)) of qualitative character.
Concrete, determining unit 42 comprises weight determination subelement and text quality's computation subunit; Wherein,
Weight determination subelement, for determining the weight of each selected qualitative character;
Text quality's computation subunit, for obtaining the text quality of corresponding candidate's text according to the tolerance weighted sum of weight to qualitative character selected in each candidate's text of each selected qualitative character.
Concrete, choose unit 43 and comprise cryptographic hash computation subunit, advantage group determination subelement and text selection subelement; Wherein,
Cryptographic hash computation subunit, for calculating the cryptographic hash of each candidate's text;
Advantage group determination subelement, for the cryptographic hash determination advantage group according to each candidate's text;
Text selection subelement, is supplied to user for the candidate's text choosing text quality the highest from advantage group.
Concrete, the cryptographic hash computation subunit chosen in unit 43 also comprises segmentation subelement, statement cryptographic hash computation subunit and text cryptographic hash computation subunit; Wherein,
Segmentation subelement, for for each candidate's text, performs: by the text of current candidate text according to setting symbol segmentation, obtain the statement that current candidate text comprises;
Statement cryptographic hash computation subunit, for calculating the cryptographic hash of each statement that current candidate text comprises;
Text cryptographic hash computation subunit, for the cryptographic hash of each statement that comprehensive current candidate text comprises, obtains the cryptographic hash of current candidate text.
Concrete, the advantage group determination subelement chosen in unit 43 also comprises combination subelement, Hamming distances computation subunit, referenced text are chosen subelement and add subelement; Wherein,
Combination subelement, for by all candidate's text combination of two, obtains candidate's text pair;
Hamming distances computation subunit, for calculating the Hamming distances of the right cryptographic hash of each candidate's text;
Referenced text chooses subelement, for choosing two candidate's texts of the minimum candidate's text pair of Hamming distances as reference text;
Add subelement, for choosing candidate's text of being less than with the Hamming distances of referenced text and setting threshold value and referenced text adds advantage group.
In the text selection device of the present embodiment, not feed back to user after candidate's text being sorted according to the sequencing captured, but the text quality of first calculated candidate text, then candidate's text the highest for text quality is supplied to user, user is without the need to browsing multiple candidate's text, directly can get best candidate's text, thus promote Consumer's Experience.
The present invention describes with reference to according to the process flow diagram of the method for the embodiment of the present invention, equipment (system) and computer program and/or block scheme.Should understand can by the combination of the flow process in each flow process in computer program instructions realization flow figure and/or block scheme and/or square frame and process flow diagram and/or block scheme and/or square frame.These computer program instructions can being provided to the processor of multi-purpose computer, special purpose computer, Embedded Processor or other programmable data processing device to produce a machine, making the instruction performed by the processor of computing machine or other programmable data processing device produce device for realizing the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.
These computer program instructions also can be stored in can in the computer-readable memory that works in a specific way of vectoring computer or other programmable data processing device, the instruction making to be stored in this computer-readable memory produces the manufacture comprising command device, and this command device realizes the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.
These computer program instructions also can be loaded in computing machine or other programmable data processing device, make on computing machine or other programmable devices, to perform sequence of operations step to produce computer implemented process, thus the instruction performed on computing machine or other programmable devices is provided for the step realizing the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.
Although describe optional embodiment of the present invention, those skilled in the art once obtain the basic creative concept of cicada, then can make other change and amendment to these embodiments.So claims are intended to be interpreted as comprising embodiment and falling into all changes and the amendment of the scope of the invention.
Obviously, those skilled in the art can carry out various change and modification to the embodiment of the present invention and not depart from the spirit and scope of the embodiment of the present invention.Like this, if these amendments of the embodiment of the present invention and modification belong within the scope of the claims in the present invention and equivalent technologies thereof, then the present invention is also intended to comprise these change and modification.

Claims (17)

1. a text selection method, is characterized in that, comprising:
Calculate the error rate of selected qualitative character in each candidate's text, and calculate the tolerance of corresponding selected qualitative character in each candidate's text according to the error rate of qualitative character selected in each candidate's text;
The text quality of corresponding candidate's text is determined according to the tolerance of qualitative character selected in each candidate's text;
The candidate's text choosing text quality the highest is supplied to user.
2. the method for claim 1, is characterized in that, calculates the error rate of selected qualitative character in each candidate's text, specifically comprises:
For each candidate's text, perform:
The quantity of each selected qualitative character of statistics current candidate text;
Respectively by the quantity of each selected qualitative character of statistics divided by the character quantity of described current candidate text, obtain the actual accounting rate of each selected qualitative character of described current candidate text;
The correspondence calculating described current candidate text according to the actual accounting rate of each selected qualitative character of described current candidate text selectes the error rate of qualitative character.
3. method as claimed in claim 2, is characterized in that, the correspondence calculating described current candidate text according to the actual accounting rate of each selected qualitative character of described current candidate text selectes the error rate of qualitative character, specifically comprises:
The error rate of qualitative character is selected by following formulae discovery:
The max-thresholds (1-selectes the standard accounting rate of the actual accounting rate/selected qualitative character of qualitative character) of the error rate of the error rate=selected qualitative character of selected qualitative character.
4. method as claimed in claim 2, is characterized in that, calculates the tolerance of corresponding selected qualitative character in each candidate's text, specifically comprise according to the error rate of qualitative character selected in each candidate's text:
The tolerance of qualitative character is selected by following formulae discovery:
Tolerance=the 1-of selected qualitative character selectes the error rate ^ (1/ (error rate+1 of selected qualitative character)) of qualitative character.
5. the method for claim 1, is characterized in that, determines the text quality of corresponding candidate's text, specifically comprise according to the tolerance of qualitative character selected in each candidate's text:
Determine the weight of each selected qualitative character;
The text quality of corresponding candidate's text is obtained according to the tolerance weighted sum of weight to qualitative character selected in each candidate's text of each selected qualitative character.
6. the method as described in as arbitrary in claim 1-5, it is characterized in that, the candidate's text choosing text quality the highest is supplied to user, specifically comprises:
Calculate the cryptographic hash of each candidate's text;
According to the cryptographic hash determination advantage group of each candidate's text;
The candidate's text choosing text quality the highest from described advantage group is supplied to user.
7. method as claimed in claim 6, is characterized in that, calculate the cryptographic hash of each candidate's text, specifically comprise:
For each candidate's text, perform:
By the text of described current candidate text according to setting symbol segmentation, obtain the statement that described current candidate text comprises;
Calculate the cryptographic hash of each statement that described current candidate text comprises;
The cryptographic hash of each statement that comprehensive described current candidate text comprises, obtains the cryptographic hash of described current candidate text.
8. method as claimed in claim 6, is characterized in that, according to the cryptographic hash determination advantage group of each candidate's text, specifically comprise:
By all candidate's text combination of two, obtain candidate's text pair;
Calculate the Hamming distances of the right cryptographic hash of each candidate's text;
Choose two candidate's texts of the minimum candidate's text pair of Hamming distances as reference text;
Choose candidate's text of being less than with the Hamming distances of described referenced text and setting threshold value and described referenced text adds described advantage group.
9. a text selection device, is characterized in that, comprising:
Computing unit, for calculating the error rate of selected qualitative character in each candidate's text, and calculates the tolerance of corresponding selected qualitative character in each candidate's text according to the error rate of qualitative character selected in each candidate's text;
Determining unit, for determining the text quality of corresponding candidate's text according to the tolerance of qualitative character selected in each candidate's text;
Choose unit, be supplied to user for the candidate's text choosing text quality the highest.
10. device as claimed in claim 9, is characterized in that, described computing unit specifically comprises statistics subelement, actual accounting rate computation subunit and error rate computation subunit; Wherein,
Described statistics subelement, for for each candidate's text, performs: the quantity of each selected qualitative character of statistics current candidate text;
Described actual accounting rate computation subunit, for respectively by the quantity of each selected qualitative character of statistics divided by the character quantity of described current candidate text, obtain the actual accounting rate of each selected qualitative character of described current candidate text;
Described error rate computation subunit, the correspondence that the actual accounting rate for each selected qualitative character according to described current candidate text calculates described current candidate text selectes the error rate of qualitative character.
11. devices as claimed in claim 10, is characterized in that, described error rate computation subunit, specifically for:
The error rate of qualitative character is selected by following formulae discovery:
The max-thresholds (1-selectes the standard accounting rate of the actual accounting rate/selected qualitative character of qualitative character) of the error rate of the error rate=selected qualitative character of selected qualitative character.
12. devices as claimed in claim 10, it is characterized in that, described computing unit also comprises tolerance computation subunit, for:
The tolerance of corresponding selected qualitative character in each candidate's text is calculated according to the error rate of qualitative character selected in each candidate's text.
13. devices as claimed in claim 12, is characterized in that, described tolerance computation subunit, specifically for:
The tolerance of qualitative character is selected by following formulae discovery:
Tolerance=the 1-of selected qualitative character selectes the error rate ^ (1/ (error rate+1 of selected qualitative character)) of qualitative character.
14. devices as claimed in claim 9, it is characterized in that, described determining unit specifically comprises weight determination subelement and text quality's computation subunit; Wherein,
Described weight determination subelement, for determining the weight of each selected qualitative character;
Described text quality computation subunit, for obtaining the text quality of corresponding candidate's text according to the tolerance weighted sum of weight to qualitative character selected in each candidate's text of each selected qualitative character.
15. as arbitrary in claim 9-14 as described in device, it is characterized in that, described in choose unit and specifically comprise cryptographic hash computation subunit, advantage group determination subelement and text selection subelement; Wherein,
Described cryptographic hash computation subunit, for calculating the cryptographic hash of each candidate's text;
Described advantage group determination subelement, for the cryptographic hash determination advantage group according to each candidate's text;
Described text selection subelement, is supplied to user for the candidate's text choosing text quality the highest from described advantage group.
16. devices as claimed in claim 15, is characterized in that, described cryptographic hash computation subunit also comprises segmentation subelement, statement cryptographic hash computation subunit and text cryptographic hash computation subunit; Wherein,
Described segmentation subelement, for for each candidate's text, performs: by the text of described current candidate text according to setting symbol segmentation, obtain the statement that described current candidate text comprises;
Described statement cryptographic hash computation subunit, for calculating the cryptographic hash of each statement that described current candidate text comprises;
Described text cryptographic hash computation subunit, for the cryptographic hash of each statement that comprehensive described current candidate text comprises, obtains the cryptographic hash of described current candidate text.
17. devices as claimed in claim 15, is characterized in that, described advantage group determination subelement specifically comprises combination subelement, Hamming distances computation subunit, referenced text are chosen subelement and add subelement; Wherein,
Described combination subelement, for by all candidate's text combination of two, obtains candidate's text pair;
Described Hamming distances computation subunit, for calculating the Hamming distances of the right cryptographic hash of each candidate's text;
Described referenced text chooses subelement, for choosing two candidate's texts of the minimum candidate's text pair of Hamming distances as reference text;
Describedly add subelement, for choosing candidate's text of being less than with the Hamming distances of described referenced text and setting threshold value and described referenced text adds described advantage group.
CN201510030778.1A 2015-01-21 2015-01-21 Text selection method and device Active CN104615681B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510030778.1A CN104615681B (en) 2015-01-21 2015-01-21 Text selection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510030778.1A CN104615681B (en) 2015-01-21 2015-01-21 Text selection method and device

Publications (2)

Publication Number Publication Date
CN104615681A true CN104615681A (en) 2015-05-13
CN104615681B CN104615681B (en) 2019-04-02

Family

ID=53150123

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510030778.1A Active CN104615681B (en) 2015-01-21 2015-01-21 Text selection method and device

Country Status (1)

Country Link
CN (1) CN104615681B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107369058A (en) * 2016-05-13 2017-11-21 华为技术有限公司 A kind of correlation recommendation method and server
CN107992548A (en) * 2017-11-27 2018-05-04 网易传媒科技(北京)有限公司 Information processing method, system, medium and computing device
CN113503897A (en) * 2021-07-08 2021-10-15 广州小鹏自动驾驶科技有限公司 Parking map quality inspection method and device and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831198A (en) * 2012-08-07 2012-12-19 人民搜索网络股份公司 Similar document identifying device and similar document identifying method based on document signature technology
CN103699521A (en) * 2012-09-27 2014-04-02 腾讯科技(深圳)有限公司 Text analysis method and device
CN103744964A (en) * 2014-01-06 2014-04-23 同济大学 Webpage classification method based on locality sensitive Hash function
CN104063399A (en) * 2013-03-22 2014-09-24 杭州金弩信息技术有限公司 Method and system for automatically identifying emotional probability borne by texts
CN104239753A (en) * 2014-07-03 2014-12-24 东华大学 Tamper detection method for text documents in cloud storage environment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831198A (en) * 2012-08-07 2012-12-19 人民搜索网络股份公司 Similar document identifying device and similar document identifying method based on document signature technology
CN103699521A (en) * 2012-09-27 2014-04-02 腾讯科技(深圳)有限公司 Text analysis method and device
CN104063399A (en) * 2013-03-22 2014-09-24 杭州金弩信息技术有限公司 Method and system for automatically identifying emotional probability borne by texts
CN103744964A (en) * 2014-01-06 2014-04-23 同济大学 Webpage classification method based on locality sensitive Hash function
CN104239753A (en) * 2014-07-03 2014-12-24 东华大学 Tamper detection method for text documents in cloud storage environment

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107369058A (en) * 2016-05-13 2017-11-21 华为技术有限公司 A kind of correlation recommendation method and server
CN107992548A (en) * 2017-11-27 2018-05-04 网易传媒科技(北京)有限公司 Information processing method, system, medium and computing device
CN113503897A (en) * 2021-07-08 2021-10-15 广州小鹏自动驾驶科技有限公司 Parking map quality inspection method and device and electronic equipment

Also Published As

Publication number Publication date
CN104615681B (en) 2019-04-02

Similar Documents

Publication Publication Date Title
TWI582619B (en) Method and apparatus for providing referral words
CN105404680A (en) Searching recommendation method and apparatus
US8478756B2 (en) Contextual document attribute values
CN103064826A (en) Method, device and system used for imputing expressions
CN105005582A (en) Recommendation method and device for multimedia information
CN104268175A (en) Data search device and method thereof
CN107992631B (en) File management method and terminal
CN111666495B (en) Case recommending method, device, equipment and storage medium
CN111563198B (en) Material recall method, device, equipment and storage medium
CN110738049A (en) Similar text processing method and device and computer readable storage medium
CN110941702A (en) Retrieval method and device for laws and regulations and laws and readable storage medium
CN108763369B (en) Video searching method and device
CN104615681A (en) Text selecting method and device
CN114241501B (en) Image document processing method and device and electronic equipment
CN109934631B (en) Question and answer information processing method and device and computer equipment
CN111666417B (en) Method, device, electronic equipment and readable storage medium for generating synonyms
CN106570003B (en) Data pushing method and device
CN116955817A (en) Content recommendation method, device, electronic equipment and storage medium
CN105354235A (en) Search result processing method and apparatus
CN111881255B (en) Synonymous text acquisition method and device, electronic equipment and storage medium
CN105335385A (en) Project-based collaborative filtering recommendation method and device
CN103902687A (en) Search result generating method and search result generating device
CN105589863B (en) Searching method, data processing method, device and system
CN110188335B (en) Page workflow construction method and device based on text processing
CN109597873B (en) Corpus data processing method and device, computer readable medium and electronic equipment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20200811

Address after: 310052 room 508, floor 5, building 4, No. 699, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province

Patentee after: Alibaba (China) Co.,Ltd.

Address before: 510627 Guangdong city of Guangzhou province Whampoa Tianhe District Road No. 163 Xiping Yun Lu Yun Ping square B radio tower 12 layer self unit 01

Patentee before: GUANGZHOU SHENMA MOBILE INFORMATION TECHNOLOGY Co.,Ltd.