CN107577667B

CN107577667B - Entity word processing method and device

Info

Publication number: CN107577667B
Application number: CN201710828725.3A
Authority: CN
Inventors: 王天畅
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2017-09-14
Filing date: 2017-09-14
Publication date: 2020-10-27
Anticipated expiration: 2037-09-14
Also published as: CN107577667A

Abstract

The invention relates to a method and a device for processing entity words, and belongs to the technical field of text processing. According to the entity word processing method and device provided by the embodiment of the invention, the query fields in the first logs are used as a plurality of candidate texts, at least one target text is determined in the candidate texts, the associated click link corresponding to the at least one target text is determined, the associated text which is similar to the target text in semantic and has the same search intention is determined according to the associated click link, and finally, the entity word is determined according to the target text and the associated text. Because the query field of the log often comprises the new entity word, the new entity word can be determined, the determination of the new entity word is realized, and the problem that the recognition rate of the new entity word by the word segmentation module is low and the video search precision is reduced because the new entity word cannot be determined in the prior art is solved.

Description

Entity word processing method and device

Technical Field

The invention relates to the technical field of text processing, in particular to a method and a device for processing entity words.

Background

Currently, when searching for a video according to search content entered by a user, a word segmentation module is generally used to segment the search content, and then each segment is used as a keyword to search for a related video in a video library as a search result. In practice, the search content entered by the user typically includes physical words. The entity words are typically fixed collocation words, phrases, and the like. Accurately, the word segmentation module identifies entity words to determine the accuracy of the video search result. In order to realize the recognition of the segmentation module on the entity word, the entity word is usually determined first and then input into the segmentation module, so that the segmentation module can accurately recognize the entity word.

In the prior art, when determining the entity words, the CRF model is usually trained by using a corpus to determine the entity words.

However, in the method for determining entity words through model training in the prior art, a large-scale training expectation generally needs to be trained, and when the number of each new entity word in the training expectation reaches a certain number, the recognition of the entity words can be achieved, however, the training expectation often does not cover the new entity words, and even if the training expectation includes the new entity words, the number of the new entity words cannot meet the requirements of the training expectation, and therefore, the new entity words cannot be determined, and the precision of the video search result is reduced.

Disclosure of Invention

In view of the above, the present invention has been made to provide a method and apparatus for processing entity words that overcome or at least partially solve the above problems.

According to a first aspect of the present invention, there is provided a method for processing entity words, the method including:

extracting query fields in a plurality of first logs in a preset first time period as candidate texts to obtain a plurality of candidate texts aiming at the plurality of first logs in the preset first time period;

screening the candidate texts to obtain at least one target text;

taking the click link of the first log with the query field as the associated click link to obtain at least one associated click link corresponding to the at least one target text; the associated click link is a link clicked when a user queries by taking the target text as query content;

determining a query field corresponding to a second log containing the associated click link as an associated text of the target text for a plurality of second logs in a preset second time period to obtain at least one associated text; the preset second time period comprises the preset first time period, and the second logs comprise the first logs;

and determining entity words according to the at least one target text and the at least one associated text.

Optionally, the step of filtering the candidate texts to obtain at least one target text includes:

removing candidate texts with the occurrence times smaller than a preset search time threshold value from the plurality of candidate texts to obtain at least one first candidate text;

performing word segmentation processing on each first candidate text in the at least one first candidate text, and counting the number of words segmented corresponding to each first candidate text;

removing the first candidate text with the corresponding word segmentation number not more than 1 to obtain at least one second candidate text;

and matching each second candidate text by using a preset format template, and taking the second candidate text which is not matched with the preset format template as a target text to obtain at least one target text.

Optionally, the step of determining an entity word according to the at least one target text and the at least one associated text includes:

for each of the at least one target text, performing the following:

performing word segmentation processing on the target text to obtain a plurality of corresponding target word segments, and combining every two adjacent target word segments in the plurality of target word segments to obtain a plurality of target word pairs;

performing word segmentation processing on a plurality of associated texts corresponding to the target text to obtain a plurality of associated words;

for each target word pair in the plurality of target word pairs, counting the frequency of each target participle in the target word pair in the plurality of associated participles;

calculating the entropy value of the target word pair according to the frequency of each target word segmentation in the target word pair;

and determining the entity words according to the entropy values of a plurality of target word pairs corresponding to the target text.

Optionally, the step of determining an entropy value of the target word pair according to the frequency of each target word segmentation in the target word pair includes:

dividing the frequency of the first target word segmentation by the sum of the frequency of the first target word segmentation and the frequency of the second target word segmentation to obtain a first entropy parameter;

dividing the frequency of the second target word segmentation by the sum of the frequency of the first target word segmentation and the frequency of the second target word segmentation to obtain a second entropy parameter;

and substituting the first entropy parameter and the second entropy parameter into a preset entropy calculation formula to obtain the entropy of the target word pair.

Optionally, the step of determining the entity word according to the entropy values of the target word pairs corresponding to the target text includes:

when the number of target word pairs with entropy values larger than a preset entropy threshold value in a plurality of target word pairs corresponding to the target text is equal to 1, determining the target word pairs with entropy values larger than the preset entropy threshold value as entity words;

when the number of target word pairs with entropy values larger than a preset entropy threshold value in a plurality of target word pairs corresponding to a target text is larger than 1, determining whether overlapped participles exist between the target word pairs with entropy values larger than the preset entropy threshold value;

if the target word pairs with the entropy values larger than the preset entropy threshold value have overlapped participles, combining the target word pairs with the overlapped participles into entity words;

and if no overlapped participles exist between the target word pairs with the entropy values larger than the preset entropy threshold, respectively determining the target word pairs with the entropy values larger than the preset entropy threshold as entity words.

According to a second aspect of the present invention, there is provided an entity word processing apparatus, the apparatus including:

the first extraction module is used for extracting a query field in a plurality of first logs in a preset first time period as candidate texts to obtain a plurality of candidate texts;

the screening module is used for screening the candidate texts to obtain at least one target text;

the second extraction module is used for taking the click link of the first log with the query field as the target text as an associated click link to obtain at least one associated click link corresponding to the at least one target text; the associated click link is a link clicked when a user queries by taking the target text as query content;

a first determining module, configured to determine, for multiple second logs within a preset second time period, a query field corresponding to the second log including the associated click link as an associated text of the target text, so as to obtain at least one associated text; the preset second time period comprises the preset first time period, and the second logs comprise the first logs;

and the second determining module is used for determining the entity words according to the at least one target text and the at least one associated text.

Optionally, the screening module includes:

the first removing submodule is used for removing the candidate texts of which the occurrence times are smaller than a preset search time threshold value from the plurality of candidate texts to obtain at least one first candidate text;

the statistic submodule is used for performing word segmentation processing on each first candidate text in the at least one first candidate text and counting the number of words segmented corresponding to each first candidate text;

the second removing submodule is used for removing the first candidate text of which the corresponding word segmentation number is not more than 1 to obtain at least one second candidate text;

and the matching sub-module is used for matching each second candidate text by using a preset format template, and taking the second candidate text which is not matched with the preset format template as a target text to obtain at least one target text.

Optionally, the second determining module includes:

the combining submodule is used for performing word segmentation processing on the target text to obtain a plurality of corresponding target words, and combining every two adjacent target words in the target words to obtain a plurality of target word pairs;

the word segmentation sub-module is used for carrying out word segmentation on a plurality of associated texts corresponding to the target text to obtain a plurality of associated words;

the statistic submodule is used for counting the frequency of occurrence of each target participle in the target word pairs in the plurality of associated participles for each target word pair in the plurality of target word pairs;

the calculation submodule is used for calculating the entropy value of the target word pair according to the frequency of each target word segmentation in the target word pair;

and the determining submodule is used for determining the entity words according to the entropy values of a plurality of target word pairs corresponding to the target text.

Optionally, the calculation sub-module is configured to:

Optionally, the determining sub-module is configured to:

Aiming at the prior art, the invention has the following advantages:

according to the entity word processing method and device provided by the embodiment of the invention, the query fields in the first logs are used as a plurality of candidate texts, at least one target text is determined in the candidate texts, the associated click link corresponding to the at least one target text is determined, the associated text which is similar to the target text in semantic and has the same search intention is determined according to the associated click link, and finally, the entity word is determined according to the target text and the associated text. Because the query field of the log often comprises the new entity word, the new entity word can be determined, the determination of the new entity word is realized, and the problem that the recognition rate of the new entity word by the word segmentation module is low and the video search precision is reduced because the new entity word cannot be determined in the prior art is solved.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is a flowchart of an entity word processing method according to an embodiment of the present invention;

FIG. 2 is a flowchart of another method for processing entity words according to the second embodiment of the present invention;

fig. 3 is a block diagram of an entity word processing apparatus according to a third embodiment of the present invention;

fig. 4 is a block diagram of another physical word processing apparatus according to the fourth embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

Example one

Fig. 1 is a flowchart of an entity word processing method according to an embodiment of the present invention, and as shown in fig. 1, the method may include:

step 101, extracting query fields in a plurality of first logs in a preset first time period as candidate texts to obtain a plurality of candidate texts.

In the embodiment of the present invention, the preset first time period may be selected by a developer according to actual requirements. Preferably, the first preset time period may be one day. For example, day 27 of 7 months may be used as the preset first time period. The click log generated within 7 months and 27 days was taken as the first log. The number of the first logs may be determined by the number of logs generated in the preset first time period, and the specific number is not limited in the embodiment of the present invention.

Taking a video search as an example, suppose that on day 7, month 27, 300 people perform a video search through the video platform, and each person performs one click on the search result. Since one search log is generated by one search operation and one click log is generated by one click operation, 600 logs are generated in 27 days in 7 months, wherein the 600 logs comprise 300 search logs and 300 corresponding click logs. Further, the 300 click logs may be taken as a first log. When determining the click log of the 600 logs, the determination can be distinguished according to the log type tag of the log. The log type tag may be classified into a search log tag and a click log tag. For example, assuming that the search log label is denoted as label 1 and the click log label is denoted as label 2, the log containing label 2 in the 600 logs may be determined as the click log.

Preferably, the click log may include at least a query field, a click link, and a click identification. The query field represents search content of a user, the click link aims at a click object of the user, and the click mark is used for indicating a search log corresponding to the click log. For example, assuming that the user performs a video search with "today's weather is really good" as the search content, the search operation of the user will generate a search log. The search log may include a query field with content "today weather is really good" and a search identifier, and assuming that the search identifier is aa, the video platform returns corresponding video 1, video 2, and video 3 as search results according to the search operation of the user. Wherein, each video corresponds to a click link, and it is assumed that the click links corresponding to the 3 videos are: link 1, link 2, and link 3, when a user clicks on a certain video, the generated click log contains the click link corresponding to the video. Assuming that the user clicks video 2 for viewing, the click log generated by the click operation may include a query field with content "today is really good weather", a click link with content "Link 2", and a click identification aa, where the click identification aa corresponds to the search identification aa in the search log.

Since the new entity word is often included in the search content, and the query field in the click log is consistent with the search content, the query field in the first log can be extracted as a candidate text for subsequent entity word determination. It should be noted that, in practical applications, a plurality of search logs within a preset first time period may also be used as the first log, and a query field of the search logs may be used as a candidate text, which is not limited in the embodiment of the present invention.

Specifically, when this step is implemented, a plurality of first logs within a preset first time period may be obtained first. Generally, logs are all stored in corresponding servers, and taking video search as an example, the embodiment of the present invention may acquire a click log generated in 7 months and 27 days stored on a video server as a first log. And then extracting the query field of each first log as candidate texts for the plurality of first logs to obtain a plurality of candidate texts.

And 102, screening the candidate texts to obtain at least one target text.

In the step, the candidate texts can be screened, the candidate texts with low probability of new entity words in the candidate texts are removed, and the target texts with high probability of new entity words are obtained.

Step 103, extracting the click link of the first log with the query field as the target text as an associated click link, and obtaining at least one associated click link corresponding to the at least one target text.

The relevant click link is a link corresponding to a click object when the user searches by taking the target text as the search content. Assume that there are two target texts, target text 1 "today's weather is really good" and target text 2 "spring breeze is not as good as you" respectively. For example, assuming that the click link included in the first log with the query field "today weather is really good" is "link 2", and the click link included in the first log with the query field "not so good as you in spring wind" is "link 3", the "link 2" may be determined as the associated click link corresponding to the target text 1 "today weather is really good", and the "link 3" may be determined as the associated click link corresponding to the target text 2 "spring wind is not as good as you", resulting in two associated click links.

And step 104, determining a query field corresponding to the second log containing the associated click link as an associated text of the target text aiming at a plurality of second logs in a preset second time period to obtain at least one associated text.

The preset second time period may include a preset first time period, and the plurality of second logs may include a plurality of first logs. In the embodiment of the present invention, the preset second time period may be selected by a developer according to actual requirements. Preferably, the preset second period of time may be one week. For example, the preset second time period may be from 24 days in 7 months to 30 days in 7 months. The click log generated within 24 days of 7 months to 30 days of 7 months was taken as the second log. The number of the second logs may be determined by the number of logs generated within a preset second time period.

In practical applications, if a certain second log contains an associated click link, it can be stated that the object clicked when the user searches in the query field of the certain second log is the same as the object clicked when the user searches in the target text. Then the query text and target text of the second log containing the associated click link may be considered semantically similar and the search intent the same. It is assumed that 100 second logs including "link 2" are provided, where query fields of 20 second logs are "no error in today's weather-", query fields of 10 second logs are "true good in today's weather", and query fields of 70 second logs are "good in today's weather", it may be determined that the target text 1 has three associated texts, where the three associated texts are "true no error in today's weather-", "true good in today's weather" and "good in today's weather".

It should be noted that, if the query field corresponding to the second log including the associated click link is consistent with the content of the target text, that is, the associated text of the target text is not determined according to the second log, it may be considered that no entity word exists in the target text, and then a subsequent step of determining an entity word may be omitted, so as to save processing cost.

And 105, determining entity words according to the at least one target text and the at least one associated text.

Since the target text and the associated text have similar semantics and the same search intention, the entity words may be determined from the target text and the associated text. For example, assuming that a new entity word is included in the target text 1, the new entity word may be divided into two or more word segments because the current word segmentation module cannot recognize the new entity word. Assume that the new entity word is divided into two segmentations t1 and t 2. Since the associated text of the target text 1 and the target text 1 have similar semantics and the same intention, the associated text has a higher probability of containing the two participles of t1 and t2, and t1 and t2 usually occur simultaneously in the associated text. In the embodiment of the invention, the occurrence conditions of t1 and t2 in the associated text can be counted, and further entity words can be determined.

In summary, in the entity word processing method provided in the first embodiment of the present invention, query fields in a plurality of first logs may be used as a plurality of candidate texts, at least one target text is determined in the plurality of candidate texts, then a relevant click link corresponding to the at least one target text is determined, a relevant text having a semantic similar to that of the target text and having a same search intention is determined according to the relevant click link, and finally, an entity word is determined according to the target text and the relevant text. Because the query field of the log often comprises the new entity word, the new entity word can be determined, the determination of the new entity word is realized, and the problem that the recognition rate of the new entity word by the word segmentation module is low and the video search precision is reduced because the new entity word cannot be determined in the prior art is solved.

Example two

Fig. 2 is a flowchart of another method for processing a physical word according to a second embodiment of the present invention, and as shown in fig. 2, the method may include:

step 201, extracting query fields in a plurality of first logs in a preset first time period as candidate texts to obtain a plurality of candidate texts.

Specifically, the implementation manner of this step may refer to step 101 described above, and details of the embodiment of the present invention are not described herein.

Step 202, removing candidate texts with the occurrence frequency smaller than a preset search frequency threshold value from the plurality of candidate texts to obtain at least one first candidate text.

In practical application, the larger the number of times a search content is searched, the larger the probability of containing an entity word is, and conversely, the smaller the probability of containing an entity word is. Therefore, a preset search time threshold may be set in this step, and candidate texts smaller than the preset search time threshold are removed. The preset search time threshold value can be determined according to actual experiments. For example, candidate texts may be filtered by preset search time thresholds of different sizes, and an appropriate preset search time threshold may be selected with reference to the last remaining text. For example, assuming that there are 300 candidate texts, since the search contents used by different users may be the same, there may be texts with the same contents in a plurality of candidate texts according to an embodiment of the present invention, that is, each candidate text may appear multiple times. Assuming that the preset search time threshold is 20, if the number of occurrences of a candidate text is less than 20, the candidate text is removed, and assuming that the 300 candidate texts are composed of text 1, text 2, text 3, and text 4, where text 1 occurs 50 times, text 2 occurs 90 times, text 3 occurs 10 times, and text 4 occurs 150 times, the text 3 may be removed, and the remaining text 1, text 2, and text 4 are used as first candidate texts, resulting in three first candidate texts.

Step 203, performing word segmentation processing on each first candidate text in the at least one first candidate text, and counting the number of words segmented corresponding to each first candidate text.

In this step, when segmenting the first candidate text, word-by-word traversal may be performed through a common segmentation library, for example, a common dictionary, and all words in the common segmentation library are respectively traversed and matched in the first candidate text according to the arrangement order, if matching is successful, the current word is determined as the segmentation of the first candidate text, and the process is repeated until all the words in the common segmentation library are matched once, and a plurality of segmentation of the first candidate text is determined. After completing the word segmentation processing on each first candidate text, the number of the corresponding word segmentation of each first candidate text may be counted.

And 204, removing the first candidate text with the corresponding word segmentation number not more than 1 to obtain at least one second candidate text.

Since entity words are generally fixed collocations, phrases, etc., there is little likelihood that an entity word is composed of a single participle. In this step, the first candidate text with the corresponding number of segmented words not greater than 1 may be removed. For example, assume that the number of words after the word segmentation of the first candidate text 1 is 1, the number of words after the word segmentation of the first candidate text 2 is 3, and the number of words of the first candidate text 3 is 5. Text 1 may be removed and the remaining text 2 and text 4 may be used as second candidate texts, resulting in two second candidate texts.

Step 205, matching each second candidate text by using a preset format template, and taking the second candidate text not matched with the preset format template as a target text to obtain at least one target text.

For example, the predetermined format template may be "other noun nz + xth season/part/period". The text corresponding to the preset format template is usually used to represent album information of a certain video, and is not used as a physical word, such as "happy book first season", etc. The second candidate text that does not match the template may be used as the target text. For example, assuming that text 2 matches the template and text 4 does not match the template, text 2 may be removed and text 4 may be the target text.

And step 206, extracting the click link of the first log with the query field as the target text as an associated click link, and obtaining at least one associated click link corresponding to the at least one target text.

Specifically, the implementation manner of this step may refer to step 103, which is not described herein again in this embodiment of the present invention.

Step 207, determining, for a plurality of second logs within a preset second time period, a query field corresponding to the second log containing the associated click link as an associated text of the target text, so as to obtain at least one associated text.

Specifically, the implementation manner of this step may refer to step 104, which is not described herein again in this embodiment of the present invention.

Step 208, determining entity words according to the at least one target text and the at least one associated text.

Optionally, in this step, for each target text in the at least one target text, the following processing may be performed:

2081, performing word segmentation processing on the target text to obtain a plurality of corresponding target words, and combining every two adjacent target words in the plurality of target words to obtain a plurality of target word pairs.

For example, taking target text 1 as an example, assuming that target participles 1, 2, 3, and 4 are obtained after word segmentation processing is performed on target text 1, target word pair 1 (target participle 1, 2), target word pair 2 (target participle 2, 3), and target word pair 3 (target participle 3, 4) may be obtained by combining.

2082, performing word segmentation processing on the multiple associated texts corresponding to the target text to obtain multiple associated words.

For example, assuming that a plurality of associated texts corresponding to the target text 1 are 50 associated texts 1 and 70 associated texts 2, respectively, performing word segmentation processing on the associated texts 1 and 2 to obtain associated participles 1, associated participles 2 and associated participles 3 corresponding to the associated text 1; and the associated participles 4, 5 and 6 correspond to the associated text 2. Wherein, the number of the associated participles 1, 2 and 3 is respectively 50; the number of the associated participles 4, 5 and 6 is 70.

Step 2083, for each target word pair in the plurality of target word pairs, counting the frequency of occurrence of each target participle in the target word pair in the plurality of associated participles.

For example, assuming that the target participle 1 corresponds to the associated participle 2, since 50 associated participles 2 are included in the associated participle, it may be determined that the occurrence frequency of the target participle 1 is 50; assuming that the target participle 2 corresponds to the associated participle 3, since the associated participle includes 50 associated participles 3, it can be determined that the occurrence frequency of the target participle 2 is 50; assuming that the target participle 3 corresponds to the associated participle 5, the target participle 3 can be determined to have a frequency of 70 because the associated participle includes 70 associated participles 5; assuming that the target participle 4 corresponds to the associated participle 6, since 70 associated participles 6 are included in the associated participle, it can be determined that the frequency of occurrence of the target participle 4 is 70.

And 2084, calculating the entropy value of the target word pair according to the frequency of each target word segmentation in the target word pair.

As can be seen from the above steps, the frequencies of the target participle 1 and the target participle 2 in the target word pair 1 are 50 and 50, the frequencies of the target participle 2 and the target participle 3 in the target word pair 2 are 50 and 70, and the frequencies of the target participle 3 and the target participle 4 in the target word pair 3 are 70 and 70, respectively.

When determining the entropy of the target word pair, the method can be realized by the following steps:

and 2084a, dividing the frequency of the first target participle by the sum of the frequency of the first target participle and the frequency of the second target participle to obtain a first entropy parameter.

For example, the first target participle may represent a preceding participle in the target word pair and the second target participle may represent a following participle in the target word pair. Taking target word pair 1 as an example, target word segmentation 1 is a first target word segmentation, target word segmentation 2 is a second target word segmentation, and the frequency of target word segmentation 1 may be divided by the sum of the frequency of target word segmentation 1 and the frequency of target word segmentation 2 to serve as a first entropy parameter. For example, the first entropy parameter of target word pair 1 may be: 50/(50+50) ═ 0.5.

And step 2084b, dividing the frequency of the second target participle by the sum of the frequency of the first target participle and the frequency of the second target participle to serve as a second entropy parameter.

For example, the second entropy parameter may be a frequency of the target participle 2 divided by a sum of the frequency of the target participle 1 and the frequency of the target participle 2. The second entropy parameter of the target word pair 1 may be: 50/(50+50) ═ 0.5.

And step 2084c, substituting the first entropy parameter and the second entropy parameter into a preset entropy calculation formula to obtain an entropy of the target word pair.

In this step, the preset formula for calculating entropy may be:

Hab＝-p_alogp_a-p_blogp_b

wherein, P_aRepresenting a first entropy parameter, P_bRepresenting a second entropy parameter. log (×) represents a logarithmic function. H_abRepresenting the entropy value of a target word pair consisting of a participle a and a participle b. In this step, the first entropy parameter and the second entropy parameter of the target word pair may be substituted into the preset entropy calculation formula, so as to calculate the entropy of the target word pair. According to the method for calculating the entropy value of the target word pair provided by the embodiment of the invention, the calculated entropy value can reflectThe degree of association between two constituent participles of the target word pair, when H_abThe larger the number of components, the higher the degree of association between two component participles indicating a target word pair, the more likely the target word pair is a component of a real word.

And 2085, determining entity words according to entropy values of a plurality of target word pairs corresponding to the target text.

Optionally, step 2085 may include:

step 2085a, when the number of target word pairs with entropy values larger than the preset entropy threshold value among the plurality of target word pairs corresponding to the target text is equal to 1, determining the target word pairs with entropy values larger than the preset entropy threshold value as entity words.

In this step, the preset entropy threshold may be determined experimentally, and preferably, the preset entropy threshold is 0.6. For example, assume that in the target word pair 1, the target word pair 2, and the target word pair 3 corresponding to the target text, the entropy value of the target word pair 1 is H_ab10.79, entropy of target word pair 2 is H_ab20.5, entropy of target word pair 3 is H_ab3As can be seen from 0.4, the target word pair 1 may be determined as a real word only if the entropy value of the target word pair 1 is greater than the preset entropy threshold value of 0.6.

Step 2085b, when the number of target word pairs with entropy values larger than the preset entropy threshold value in the plurality of target word pairs corresponding to the target text is larger than 1, determining whether overlapped participles exist between the target word pairs with entropy values larger than the preset entropy threshold value.

For example, assume that in the target word pair 1, the target word pair 2, and the target word pair 3 corresponding to the target text, the entropy value of the target word pair 1 is H_ab10.79, entropy of target word pair 2 is H_ab20.5, entropy of target word pair 3 is H_ab30.8, wherein the entropy value of the target word pair 1 and the entropy value of the target word pair 3 are greater than a preset entropy threshold. At this time, the number of the target word pairs with entropy values larger than the preset entropy threshold is 2, and since 2 is larger than 1, it can be further determined whether there is an overlapping participle between the target word pair 1 and the target word pair 3.

Step 2085c, if there are overlapped participles between the target word pairs with entropy values larger than the preset entropy threshold, combining the target word pairs with the overlapped participles into entity words.

Assuming that a target word pair 1 corresponding to a target text is composed of a participle 1 and a participle 2, and a target word pair 2 is composed of a participle 3 and a participle 4, and a target word pair 3 is composed of a participle 2 and a participle 3, it can be seen that an overlapped participle exists between the target word pair 1 and the target word pair 3, and the overlapped participle is the participle 2, and at this time, the target word pair 1 and the target word pair 3 can be combined into a solid word. When the combination is performed, only one overlapped participle is reserved, that is, the participle 1, the participle 2 and the participle 3 are combined into one entity word.

Step 2085d, if there is no overlapped participle between the target word pairs with entropy values larger than the preset entropy threshold, determining each target word pair with entropy values larger than the preset entropy threshold as a real word.

Assuming that a target word pair 1 corresponding to a target text is composed of a participle 1 and a participle 2, and a target word pair 2 is composed of a participle 2 and a participle 3, and a target word pair 3 is composed of a participle 3 and a participle 4, it can be seen that there is no overlapped participle between the target word pair 1 and the target word pair 3, and at this time, the target word pair 1 can be determined as an entity word, and the target word pair 3 can be determined as an entity word.

In the embodiment of the invention, after the new entity word is determined, the new entity word can be transmitted to the word segmentation module. Specifically, the new entity words may be stored in the word segmentation module in a text format, so that the word segmentation module updates the entity word bank. Specifically, the word segmentation module can load the stored new entity words and perform duplication judgment and fusion with the existing entity library to form a new entity word library. Therefore, when the word segmentation module is used for segmenting the search content, the word segmentation module can accurately identify the new entity words.

In summary, in the entity word processing method provided in the second embodiment of the present invention, query fields in the plurality of first logs may be used as a plurality of candidate texts, at least one target text is determined in the plurality of candidate texts, then a relevant click link corresponding to the at least one target text is determined, a relevant text having a semantic similar to that of the target text and having a same search intention is determined according to the relevant click link, and finally, an entity word is determined according to the target text and the relevant text. Because the query field of the log often comprises the new entity word, the new entity word can be determined, and then the word segmentation module is updated by using the new entity word, so that the word segmentation module can accurately determine the new entity word, and the video search precision is improved; meanwhile, the candidate texts are screened, the candidate texts containing new entity words with low probability are removed, and the workload of subsequent text processing is reduced.

EXAMPLE III

Fig. 3 is a block diagram of an entity word processing apparatus according to a third embodiment of the present invention, and as shown in fig. 3, the apparatus 30 may include:

the first extraction module 301 is configured to, for multiple first logs within a preset first time period, extract a query field in the first logs as a candidate text to obtain multiple candidate texts.

The screening module 302 is configured to screen the multiple candidate texts to obtain at least one target text.

A second extraction module 303, configured to use a click link of a first log with a query field as the target text as an associated click link, and obtain at least one associated click link corresponding to the at least one target text; and the associated click link is a link clicked when the user queries by taking the target text as query content.

A first determining module 304, configured to determine, for multiple second logs within a preset second time period, a query field corresponding to the second log including the associated click link as an associated text of the target text, so as to obtain at least one associated text; the preset second time period comprises the preset first time period, and the second logs comprise the first logs.

A second determining module 305, configured to determine an entity word according to the at least one target text and the at least one associated text.

In summary, in the entity word processing apparatus provided in the third embodiment of the present invention, the first extraction module may use the query fields in the plurality of first logs as a plurality of candidate texts, the screening module determines at least one target text in the plurality of candidate texts, the second extraction module determines the associated click link corresponding to the at least one target text, the first determination module determines, according to the associated click link, an associated text having a semantic similar to that of the target text and having a same search intention, and finally, the second determination module determines the entity word according to the target text and the associated text. Because the query field of the log often comprises the new entity word, the new entity word can be determined, the determination of the new entity word is realized, and the problem that the recognition rate of the new entity word by the word segmentation module is low and the video search precision is reduced because the new entity word cannot be determined in the prior art is solved.

Example four

Fig. 4 is a block diagram of another physical word processing apparatus according to a fourth embodiment of the present invention, and as shown in fig. 4, the apparatus 40 may include:

the first extraction module 401 is configured to, for multiple first logs within a preset first time period, extract a query field in the first logs as a candidate text to obtain multiple candidate texts.

A screening module 402, configured to screen the multiple candidate texts to obtain at least one target text.

A second extraction module 403, configured to use the click link of the first log whose query field is the target text as an associated click link, and obtain at least one associated click link corresponding to the at least one target text; and the associated click link is a link clicked when the user queries by taking the target text as query content.

A first determining module 404, configured to determine, for multiple second logs within a preset second time period, a query field corresponding to the second log including the associated click link as an associated text of the target text, so as to obtain at least one associated text; the preset second time period comprises the preset first time period, and the second logs comprise the first logs.

A second determining module 405, configured to determine an entity word according to the at least one target text and the at least one associated text.

Optionally, the screening module 402 may include:

the first removing sub-module 4021 is configured to remove candidate texts with occurrence times smaller than a preset search time threshold from the multiple candidate texts, so as to obtain at least one first candidate text.

The statistic sub-module 4022 is configured to perform word segmentation processing on each first candidate text in the at least one first candidate text, and count the number of words segmented corresponding to each first candidate text.

The second removing sub-module 4023 is configured to remove the first candidate text whose corresponding number of segmented words is not greater than 1, so as to obtain at least one second candidate text.

The matching sub-module 4024 is configured to match each second candidate text by using a preset format template, and use the second candidate text that is not matched with the preset format template as a target text to obtain at least one target text.

Optionally, the second determining module 405 may include:

the combining sub-module 4051 is configured to perform word segmentation processing on the target text to obtain a plurality of corresponding target words, and combine every two adjacent target words in the plurality of target words to obtain a plurality of target word pairs.

The word segmentation sub-module 4052 is configured to perform word segmentation processing on the multiple associated texts corresponding to the target text to obtain multiple associated words.

The counting sub-module 4053 is configured to count, for each target word pair of the plurality of target word pairs, a frequency of occurrence of each target participle of the target word pair in the plurality of associated participles.

The calculating submodule 4054 is configured to calculate an entropy value of the target word pair according to the frequency of each target word segmentation in the target word pair.

The determining sub-module 4055 is configured to determine the entity word according to entropy values of a plurality of target word pairs corresponding to the target text.

Optionally, the calculating sub-module 4054 may be configured to:

and dividing the frequency of the first target word segmentation by the sum of the frequency of the first target word segmentation and the frequency of the second target word segmentation to obtain a first entropy parameter.

And dividing the frequency of the second target word segmentation by the sum of the frequency of the first target word segmentation and the frequency of the second target word segmentation to obtain a second entropy parameter.

Optionally, the determining sub-module 4055 may be configured to:

and when the number of target word pairs with entropy values larger than a preset entropy threshold value in a plurality of target word pairs corresponding to the target text is equal to 1, determining the target word pairs with entropy values larger than the preset entropy threshold value as the entity words.

When the number of target word pairs with entropy values larger than a preset entropy threshold value in a plurality of target word pairs corresponding to the target text is larger than 1, determining whether overlapped participles exist between the target word pairs with entropy values larger than the preset entropy threshold value.

And if the target word pairs with the entropy values larger than the preset entropy threshold value have the overlapped participles, combining the target word pairs with the overlapped participles into the entity word.

In summary, in the entity word processing apparatus provided in the fourth embodiment of the present invention, the first extraction module may use query fields in the plurality of first logs as a plurality of candidate texts, the screening module determines at least one target text in the plurality of candidate texts, the second extraction module determines an associated click link corresponding to the at least one target text, the first determination module determines, according to the associated click link, an associated text having a semantic similar to that of the target text and having a same search intention, and finally, the second determination module determines the entity word according to the target text and the associated text. Because the query field of the log often comprises the new entity word, the new entity word can be determined, and then the word segmentation module is updated by using the new entity word, so that the word segmentation module can accurately determine the new entity word, and the video search precision is improved; meanwhile, the candidate texts are screened, the candidate texts containing new entity words with low probability are removed, and the workload of subsequent text processing is reduced.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. As is readily imaginable to the person skilled in the art: any combination of the above embodiments is possible, and thus any combination between the above embodiments is an embodiment of the present invention, but the present disclosure is not necessarily detailed herein for reasons of space.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the invention and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

Claims

1. An entity word processing method, characterized in that the method comprises:

screening the candidate texts to obtain at least one target text;

extracting click links of a first log with query fields as the target texts as associated click links to obtain at least one associated click link corresponding to the at least one target text; the associated click link is a link clicked when a user queries by taking the target text as query content;

determining entity words according to the at least one target text and the at least one associated text;

wherein the step of determining the entity word from the at least one target text and the at least one associated text comprises:

for each of the at least one target text, performing the following:

2. The method of claim 1, wherein the step of filtering the candidate texts to obtain at least one target text comprises:

3. The method of claim 1, wherein the step of calculating the entropy of the target word pair based on the frequency of each target participle in the target word pair comprises:

4. The method according to claim 3, wherein the step of determining the entity word according to the entropy values of the target word pairs corresponding to the target text comprises:

5. An apparatus for processing a solid word, the apparatus comprising:

a second determining module, configured to determine an entity word according to the at least one target text and the at least one associated text;

wherein the second determining module comprises:

6. The apparatus of claim 5, wherein the screening module comprises:

7. The apparatus of claim 5, wherein the computation submodule is configured to:

8. The apparatus of claim 7, wherein the determination submodule is configured to: