CN109740152B

CN109740152B - Text category determination method and device, storage medium and computer equipment

Info

Publication number: CN109740152B
Application number: CN201811592736.7A
Authority: CN
Inventors: 张长旺; 张纪红
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-12-25
Filing date: 2018-12-25
Publication date: 2023-02-17
Anticipated expiration: 2038-12-25
Also published as: CN109740152A

Abstract

The application relates to a text category determination method, a text category determination device, a computer readable storage medium and computer equipment, wherein the method comprises the following steps: extracting keywords of a text to be processed, and determining the weight of each keyword; obtaining semantic description information corresponding to each keyword respectively; determining first relevance of each keyword and each candidate category according to each semantic description information; determining second relevance between the text to be processed and each candidate category according to the weight of each keyword and each first relevance; and determining the category to which the text to be processed belongs from the candidate categories according to the second correlation degrees. The scheme provided by the application can save labor cost, and eliminates the dependence of the quality of the category to which the text to be processed belongs on the quality of manual labeling.

Description

Text category determination method and device, storage medium and computer equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for determining text categories, a computer-readable storage medium, and a computer device.

Background

The text category marking means that the text is marked as one or more categories in a category system. Text category labeling has wide application in a large number of business scenarios such as advertising, recommendation, searching, and the like. Determining the category to which the text belongs is an important link in text category labeling.

In a traditional text category determination mode, categories to which a plurality of texts belong are manually marked to obtain training samples, machine learning models such as a neural network are trained according to the training samples to obtain mapping models, then texts to be processed are input into the mapping models, and the categories of the texts to be processed are determined through the mapping models. However, the process of manually labeling the training samples consumes a lot of manpower. And the mapping model is obtained by training according to the training sample of the manual labeling, so that the quality of the category to which the text to be processed belongs is determined to have serious dependence on the quality of the manual labeling.

Disclosure of Invention

Based on this, it is necessary to provide a method, an apparatus, a computer-readable storage medium, and a computer device for determining a text category, aiming at the technical problems that a lot of manpower is consumed in the conventional manner and the determination of the quality of the category to which the text to be processed belongs has a serious dependency on the quality of the manual labeling.

A method of determining a text category, comprising:

extracting keywords of a text to be processed, and determining the weight of each keyword;

obtaining semantic description information corresponding to each keyword respectively;

determining first relevance of each keyword and each candidate category according to each semantic description information;

determining second relevance between the text to be processed and each candidate category according to the weight of each keyword and each first relevance;

and determining the category to which the text to be processed belongs from the candidate categories according to the second correlation degrees.

An apparatus for determining a text category, comprising:

the keyword processing module is used for extracting keywords of the text to be processed and determining the weight of each keyword;

a semantic description information acquisition module for acquiring semantic description information corresponding to each of the keywords;

the first relevancy determining module is used for determining the first relevancy between each keyword and each candidate category according to each semantic description information;

the second relevance determining module is used for determining second relevance between the text to be processed and each candidate category according to the weight of each keyword and each first relevance;

and the text category determining module is used for determining the category to which the text to be processed belongs from each candidate category according to each second correlation degree.

A computer-readable storage medium, storing a computer program which, when executed by a processor, causes the processor to carry out the steps of the method of determining text categories as described above.

A computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the steps of the method of text category determination as described above.

The method, the device, the computer readable storage medium and the computer equipment for determining the text category extract the keywords of the text to be processed, obtain the weight of each keyword, then obtain the semantic description information corresponding to each keyword, determine the first correlation degree between each keyword and the candidate category according to each semantic description information, determine the second correlation degree between the text to be processed and each candidate category according to the weight of each keyword and each first correlation degree, and further determine the category to which the text to be processed belongs from each candidate category according to each second correlation degree. Therefore, under the condition that no text with known belonged categories exists, the category to which any text belongs can be automatically determined in the whole process by the computer equipment, so that the link of manually marking the category is omitted, the labor cost is saved, and the dependency of determining the quality of the category to which the text to be processed belongs on the quality of manual marking is eliminated.

Drawings

FIG. 1 is a diagram of an exemplary implementation of a method for determining text categories;

FIG. 2 is a flowchart illustrating a method for determining text categories in one embodiment;

FIG. 3 is a diagram illustrating an embodiment of a process for determining a first relevance of a keyword to a candidate class;

FIG. 4 is a diagram illustrating a process for determining a second degree of association of text with a candidate category, in accordance with one embodiment;

FIG. 5 is a schematic diagram illustrating an interface for displaying and querying category annotation results for a text, in accordance with an embodiment;

FIG. 6 is a schematic flow chart illustrating the manner in which the first proportional threshold is determined in one embodiment;

FIG. 7 is a diagram illustrating the process of determining the number of remaining words in determining the first scaling threshold in one embodiment;

FIG. 8 is a diagram illustrating an embodiment of a process for determining a first relevance of a keyword to a candidate class;

FIG. 9 is a diagram that illustrates a process for determining a first relevance of a keyword to a candidate class, in one embodiment;

FIG. 10 is a schematic diagram of an interface for manually entering associated knowledge in one embodiment;

FIG. 11 is a schematic diagram of an interface for manually entering category priority information in one embodiment;

FIG. 12 is a flowchart illustrating a method for determining text categories in one embodiment;

FIG. 13 is a block diagram showing the construction of a text-class determining apparatus according to an embodiment;

FIG. 14 is a block diagram showing the construction of a computer device according to one embodiment;

FIG. 15 is a block diagram that illustrates the architecture of a computing device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more clearly understood, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

It should be noted that the terms "first," "second," and the like as used herein are used for naming the similar objects, but the objects themselves are not limited by these terms. It should be understood that these terms are interchangeable under appropriate circumstances without departing from the scope of the present application. For example, a "first participle" may be described as a "second participle", and similarly, a "second participle" may be described as a "first participle".

The method for determining the text category provided by the embodiments of the present application can be applied to an application environment as shown in fig. 1. The application environment may involve the terminal 110 and the server 120, and the terminal 110 and the server 120 are connected through a network.

Specifically, the terminal 110 acquires a text to be processed and transmits the text to be processed to the server 120. The server 120 extracts keywords of the text to be processed, obtains weights of the keywords, obtains semantic description information corresponding to the keywords, determines first relevance between the keywords and candidate categories according to the semantic description information, determines second relevance between the text to be processed and the candidate categories according to the weights of the keywords and the first relevance, and determines the category to which the text to be processed belongs from the candidate categories according to the second relevance.

In other application environments, the server 120 may be involved only, and the terminal 110 is not involved, and accordingly, a series of steps from acquiring the text to be processed to determining the category to which the text to be processed belongs from among the candidate categories may be performed by the server 120 independently. Alternatively, only the terminal 110 may be involved, without involving the server 120, whereby a series of steps from acquiring the text to be processed to determining the category to which the text to be processed belongs from among the candidate categories is independently performed by the terminal 110.

The terminal 110 may include at least one of a mobile phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant, a wearable device, and the like, but is not limited thereto. The server 120 may be implemented as a stand-alone server or a server cluster composed of a plurality of servers.

In one embodiment, as shown in FIG. 2, a method for text category determination is provided. The method is applied to a computer device (such as the terminal 110 or the server 120 in fig. 1) as an example. The method may include the following steps S202 to S210.

S202, extracting keywords of the text to be processed, and obtaining the weight of each keyword.

The text to be processed is the text to be determined for the category to which the text belongs. The text to be processed may be a short text, where the short text is a text with a short text length, such as a text with no more than 160 characters, and common short texts include microblog information, article titles, opinion comments, short mobile messages, and document summaries, but are not limited thereto. The text to be processed may also be a long text, which is a text having a longer text length than a short text.

The keywords can be representative words in the text to be processed, and can be used for representing the theme idea of the text to be processed. Specifically, the keywords of the text to be processed may be obtained by performing keyword extraction processing on the text to be processed. The keyword extraction process may be implemented by any suitable keyword extraction method, such as TextRank algorithm, rake algorithm, and Topic-Model algorithm, and the like, which is not limited herein.

The weight of the keyword can be used to represent the importance degree of the keyword to the text to be processed. The weight of a key may be determined based on the TF-IDF value of the key. The TF-IDF value of the keyword is the Term Frequency (Term Frequency, TF) of the keyword in the text to be processed and the Inverse Document Frequency (IDF) of the keyword.

The word frequency of the keywords in the text to be processed is the number of occurrences of the keywords in the text to be processed. The inverse document frequency of the keyword may be:

。

in one embodiment, the target corpus may be a corpus corresponding to a web search service. Accordingly, the number of objects including the keyword in all the objects of the target corpus may be the total number of all search results obtained by searching the keyword through the web search service, and the number of all the objects in the target corpus may be set to a predetermined value, for example: 1 +100000000.

It should be noted that, after the network search service is invoked to search the keyword, the number of all search results obtained by searching the keyword and the semantic description information corresponding to the keyword can be obtained together, and the two are associated with each other. Therefore, when the semantic description information corresponding to the keyword is acquired, the parameter of the number of all search results obtained by searching the keyword can be acquired together, and the parameter can be directly used when the inverse document frequency of the keyword is calculated without temporarily calling a network search service to acquire the parameter.

In an embodiment, the weight of each keyword is determined according to the TF-IDF value of each keyword of the text to be processed, which may be specifically implemented as follows: and the computer equipment determines the original weight of each keyword according to the TF-IDF value of each keyword of the text to be processed, and normalizes the original weight of each keyword to obtain the weight of each keyword. The normalization of the original weights of the keywords may be performed by dividing the original weights of the keywords by the sum of the original weights of the keywords of the text to be processed. In addition, the original weight of the keyword may be the TF-IDF value of the keyword itself or the product of the TF-IDF value of the keyword and the word length of the keyword.

And S204, obtaining semantic description information corresponding to each keyword.

Semantic description information is information for helping understanding the meaning expressed by a keyword. The data form of the semantic description information may be a text file.

In one embodiment, the semantic description information corresponding to the keyword may be determined according to information (hereinafter, referred to as expert description information) for describing the keyword, which is sorted by related persons, and the related persons may be experts in the related field. Specifically, the expert may sort the expert description information corresponding to each candidate keyword, and then construct an expert knowledge base according to each candidate keyword, each expert description information, and the matching relationship between each candidate keyword and each expert description information, so that when the semantic description information of a keyword needs to be obtained, the candidate keyword corresponding to the keyword is searched in the expert knowledge base, and the semantic description information of the keyword may include the expert description information matched with the searched candidate keyword.

And S206, determining first relevance of each keyword and each candidate category according to each semantic description information.

The first degree of relevance of a keyword to a candidate category is a metric that may be used to measure the degree of match between the keyword and the candidate category. The value range of the first correlation may be [0, +1], and a larger first correlation indicates a higher matching degree between the keyword and the candidate category, whereas a smaller first correlation indicates a lower matching degree between the keyword and the candidate category.

In this embodiment, the number of the candidate categories is more than 1, and the computer device determines, according to the semantic description information corresponding to each keyword, a first relevance between each keyword and each candidate category.

For example, as shown in fig. 3, keyword extraction is performed on the text LT1 to be processed to obtain 3 keywords: keyword Kw1, keyword Kw2, and keyword Kw3, where keyword Kw1 corresponds to semantic description information Sd1, keyword Kw2 corresponds to semantic description information Sd2, and keyword Kw3 corresponds to semantic description information Sd3, and there are 3 candidate categories: candidate category C1, candidate category C2, and candidate category C3.

Accordingly, the computer device determines a first degree of correlation of the keyword Kw1 with the candidate category C1, a first degree of correlation of the keyword Kw1 with the candidate category C2, and a first degree of correlation of the keyword Kw1 with the candidate category C3, based on the semantic description information Sd 1. And, the computer device determines a first degree of correlation of the keyword Kw2 with the candidate category C1, a first degree of correlation of the keyword Kw2 with the candidate category C2, and a first degree of correlation of the keyword Kw2 with the candidate category C3, based on the semantic description information Sd 2. And the computer determines a first degree of correlation between the keyword Kw3 and the candidate category C1, a first degree of correlation between the keyword Kw3 and the candidate category C2, and a first degree of correlation between the keyword Kw3 and the candidate category C3 according to the semantic description information Sd 3.

In one embodiment, the computer device may determine a first degree of relevance between each keyword and each candidate category according to semantic description information corresponding to each keyword and category description information of each candidate category. For example, the computer device determines a first correlation degree between the keyword Kw1 and the candidate category C1 according to the semantic description information Sd1 and the category description information of the candidate category C1. The candidate category description information is information that can be used to reflect the characteristics of the candidate category.

And S208, determining second correlation degrees of the text to be processed and each candidate category according to the weight of each keyword and each first correlation degree.

The second degree of correlation between the text to be processed and the candidate category is a metric that can be used to measure the degree of matching between the text to be processed and the candidate category. The value range of the second degree of correlation may be [0, +1], and the larger the second degree of correlation is, the higher the matching degree between the text to be processed and the candidate category is, whereas the smaller the second degree of correlation is, the lower the matching degree between the text to be processed and the candidate category is.

In this embodiment, for each candidate category, the computer device performs weighted summation according to the weight of each keyword of the text to be processed and the first correlation between each keyword and the candidate category, so as to obtain a second correlation between the text to be processed and the candidate category.

In connection with the foregoing example, as shown in fig. 4, the computer device performs weighted summation according to the weight of the keyword Kw1, the first correlation degree between the keyword Kw1 and the candidate category C1, the weight of the keyword Kw2, the first correlation degree between the keyword Kw2 and the candidate category C1, the weight of the keyword Kw3, and the first correlation degree between the keyword Kw3 and the candidate category C1 to obtain the second correlation degree between the text LT1 to be processed and the candidate category C1. And the computer equipment carries out weighted summation according to the weight of the keyword Kw1, the first correlation degree of the keyword Kw1 and the candidate category C2, the weight of the keyword Kw2, the first correlation degree of the keyword Kw2 and the candidate category C2, the weight of the keyword Kw3 and the first correlation degree of the keyword Kw3 and the candidate category C2 to obtain a second correlation degree of the text LT1 to be processed and the candidate category C2. And the computer equipment carries out weighted summation according to the weight of the keyword Kw1, the first correlation degree of the keyword Kw1 and the candidate category C3, the weight of the keyword Kw2, the first correlation degree of the keyword Kw2 and the candidate category C3, the weight of the keyword Kw3 and the first correlation degree of the keyword Kw3 and the candidate category C3 to obtain a second correlation degree of the text LT1 to be processed and the candidate category C3.

And S210, determining the category to which the text to be processed belongs from the candidate categories according to the second correlation degrees.

The category to which the text to be processed belongs may include a candidate category in each candidate category, where the second degree of correlation with the text to be processed satisfies the correlation screening condition. Wherein, the relevancy screening condition can be set according to actual requirements.

Specifically, the relevancy screening conditions may include: and the second correlation degree with the text to be processed is equal to or greater than a correlation degree threshold value, and the correlation degree threshold value is predetermined according to actual requirements. The relevancy screening conditions may also include: the second degree of correlation with the text to be processed belongs to a predetermined number of second degrees of correlation with the largest value among the second degrees of correlation, that is, the second degrees of correlation are sorted according to the magnitude of each second degree of correlation and sequentially reduced from front to back, the category to which the text to be processed belongs may include candidate categories corresponding to the predetermined number of second degrees of correlation arranged in front, and the predetermined number may be set as any positive integer according to actual requirements.

It should be noted that a category system (a system formed by defining and dividing a target field) may be set in advance, and the category system includes more than 1 candidate category. Therefore, the category to which the text to be processed belongs can be determined from the candidate categories contained in the category system according to the second correlation degrees.

In one embodiment, after step S210, the following steps may be further included: and carrying out category marking on the text to be processed according to the category to which the text to be processed belongs. The category labeling may specifically be outputting a category labeling result corresponding to the text to be processed, and the category labeling result may include the text to be processed, a category to which the text to be processed belongs, and a degree of correlation between the text to be processed and the category to which the text to be processed belongs. The relevance of the text to be processed and the category to which the text to be processed belongs may be determined according to a second relevance of the text to be processed and the category to which the text to be processed belongs, for example, the second relevance of the text to be processed and the category to which the text to be processed belongs may be the second relevance itself.

Therefore, in an actual application scene, an automatic labeling system can be built based on the text category determination method, and the automatic labeling system can be used for labeling the text categories. In addition, the automatic labeling system can also provide display and query services of the category labeling results of the texts. The display and query interface can be as shown in fig. 5, and the user can click controls of "previous page" and "next page" in the interface and browse the text with completed category labeling by page; or inputting a text or a text ID in the input box 500, and then clicking the query control to query the category ID, the category name and the relevance between the category ID and the text of the category to which the text belongs; a category name or a category ID may also be input in the input box 500 to query text under the corresponding category.

It should be noted that, for the conventional method of manually labeling categories to which a plurality of texts belong to obtain training samples, then training machine learning models such as a neural network according to the training samples to obtain mapping models, further inputting texts to be processed into the mapping models, and determining the categories of the texts to be processed through the mapping models, the following defects exist, and the requirement for determining the categories to which a large amount of texts belong in a large number of real service scenes such as advertisement, recommendation, search, and the like cannot be met.

(1) The process of manually labeling training samples consumes a large amount of manpower. Specifically, the number of training samples that need to be labeled manually increases linearly with the number of candidate classes contained in the class hierarchy. For example, if 1 ten thousand pieces of training data are manually labeled for supporting one candidate category, 1000 ten thousand pieces of training data are manually labeled for supporting a category system including 1000 candidate categories, which consumes huge manpower and material resources.

(2) The quality of the category to which the text to be processed belongs is determined to depend heavily on the quality of the manual labeling. Determining the category to which the text belongs through the mapping model, and if the category to which the text belongs is determined with high quality, requiring that the manual labeling of the training sample has extremely high accuracy, and simultaneously requiring that the proportion distribution of each candidate category contained in the category system of the training sample is consistent with the real whole sample. However, training samples generated by manual labeling (especially large-scale manual labeling) are difficult to meet the above requirements, and in practical applications, the quality of the manually labeled training samples is generally poor, so that the category to which the text to be processed belongs cannot be determined with high quality in the conventional manner.

(3) New knowledge cannot be learned automatically and the mapping model cannot be updated automatically. The traditional method is to learn the knowledge for mapping the text to the categories from manually labeled training samples to obtain a mapping model, so that the new knowledge cannot be automatically learned without a new training sample and the mapping model cannot be automatically updated. However, for the conventional method, if no manual work is introduced to participate in labeling again, the mapping model cannot understand the text to be processed containing the new words and the hot words, and thus cannot accurately determine the category to which the text to be processed belongs.

(4) The method does not support the introduction of manually accumulated business knowledge into the process of determining the category to which the text to be processed belongs. In a common scene in real services, related personnel have a great deal of service knowledge about the category to which a text belongs, such as which keywords are related to which categories, which categories in candidate categories included in a category system should be considered preferentially, and the like, and the service knowledge accumulated by the related personnel is introduced into the process of determining the category to which the text to be processed belongs, so that the quality of determining the category to which the text to be processed belongs can be improved.

The method for determining the text category provided by the embodiment of the application extracts the keywords of the text to be processed, obtains the weights of the keywords, acquires semantic description information corresponding to the keywords respectively, determines first correlation degrees of the keywords with candidate categories according to the semantic description information, determines second correlation degrees of the text to be processed and the candidate categories according to the weights of the keywords and the first correlation degrees, and determines the category to which the text to be processed belongs from the candidate categories according to the second correlation degrees. Therefore, under the condition that no text with known belonged categories exists, the category to which any text belongs can be automatically determined in the whole process by the computer equipment, so that the link of manually marking the category is omitted, the labor cost is saved, and the dependence of determining the quality of the category to which the text to be processed belongs on the quality of manual marking is eliminated. In addition, the intermediate logic is understandable for a human from the process of acquiring the text to be processed to the process of determining the category to which the text to be processed belongs, so that the process of determining the category to which the text to be processed belongs by manually accumulating business knowledge is possible.

In one embodiment, the step of extracting the keywords of the text to be processed may include the steps of: performing word segmentation on a text to be processed to obtain a plurality of first words of the text to be processed; removing first participles belonging to a target filtering word bank from all the first participles to obtain one or more second participles; each second word segmentation comprises the first word segmentation left after the removal; and obtaining the keywords of the text to be processed according to the second word segmentation.

And the word segmentation processing is used for segmenting a plurality of words from the text to be processed. The participle processing may be implemented in any possible participle manner, such as Conditional Random Field (CRF) participle, JIEBA participle (i.e., ending participle), NLPIR participle, LTP (length Technology Platform) participle, or THULAC (THU Lexical Analyzer for Chinese) participle, etc.

The conditional random field word segmentation is performed on the text to be processed according to the conditional random field theory and comprehensively considering the frequency of the words appearing in the text to be processed and the context of the words, and the method has a good word segmentation effect on ambiguous words and new words.

It should be noted that, when performing word segmentation processing on a text to be processed, the following optimization strategies may be adopted: the text content in the book name number can be taken as a whole word without being divided, for example, the text to be processed is the metaphorical meaning of a tiger in 'magic drift of juvenile' and the whole 'magic drift of juvenile' is taken as a word without dividing words such as 'juvenile' and 'magic drift'.

For the case that the text to be processed is a short text, at least one of the following two optimization strategies can be adopted: the text content in the preset form of parentheses can be taken as a participle as a whole without being divided, and the preset form of parentheses comprises at least one of small parentheses, middle parentheses and large parentheses; the text content before the first colon near the head end of the text in the text to be processed can be taken as a word segmentation as a whole without segmentation, for example, the text to be processed is "LPL battle: difficult to let one chase two WE2:1, turning over TOP, and taking LPL warfare newspaper as a whole word segmentation.

By taking the whole text content in the above situation as a word segmentation without segmentation, the actual meaning of the specific text content is effectively retained, and the accuracy of determining the keywords is improved.

The filtering word bank is a database for recording filtering words. The filtered lexicon may correspond one-to-one with the data source. The data source to which the text belongs is a source of the text, such as a data source to which a subject text of a public number concerned by the user can be attributed, a data source to which a title text of an article read by the user is attributed, a data source to which a description text of an article purchased by the user is attributed, a data source to which a description text of a video viewed by the user is attributed, and the like.

The filter words recorded in the filter word bank are words to be removed from the first segmentation words of the text to be processed, and may include at least one of words without actual meaning and words common in the text covered by the corresponding data source. It can be understood that a word common in the texts covered by the data source is difficult to characterize the characteristics of a single text in the data source, so that each text covered by the data source is difficult to distinguish according to the word, and therefore the word can be taken as a filter word of the data source and is received in a filter word bank corresponding to the data source.

And the target filtering word bank is a filtering word bank corresponding to the target data source. The target data source is a data source to which the text to be processed belongs. In specific implementation, the computer device may determine a data source (i.e., a target data source) to which the text to be processed belongs, and then determine a filtering lexicon corresponding to the target data source (i.e., a target filtering lexicon).

In this embodiment, the computer device may perform word segmentation on the text to be processed to obtain first words of the text to be processed, then remove, from the first words, first words that are the same as the filter words recorded in the target filter lexicon, and further obtain the keywords of the text to be processed according to the remaining first words (i.e., second words) after removal.

In one embodiment, the method for constructing the target filtering word library may include the following steps: performing word segmentation processing on the texts belonging to the target data source to obtain a plurality of third words; respectively determining a first proportion corresponding to each third participle; and constructing a target filtering word bank according to the third participles of which the first proportion exceeds the first proportion threshold value.

The first proportion corresponding to the third participle may be a proportion of the number of texts in the target data source containing the third participle to the total number of texts in the target data source. Assuming that the target data source of "sports news" covers 10 texts, wherein 6 texts contain the third word "match", the first ratio corresponding to the third word "match" is

。

And the first scale threshold is used as a standard for judging whether the third participle is a common word in each text covered by the target data source. The first proportion corresponding to the third participle exceeds a first proportion threshold value, which indicates that the third participle is a common word in each text covered by the target data source and is to be used as a filter word of the target data source; the first proportion corresponding to the third participle does not exceed the first proportion threshold, which indicates that the third participle is not a common word in each text covered by the target data source and should not be used as a filter word of the target data source. The first proportional threshold may be determined in any suitable manner, for example, manually set according to actual requirements. Taking the foregoing example in mind, the first proportional threshold is set20%, "match" the third part word corresponds to the first ratio: (

) And if the content exceeds 20%, the third word segmentation of the game is a word which is common in each text covered by the target data source of the sports news, and the word is used as a filter word of the target data source.

In this embodiment, the target data source covers a plurality of texts, and the computer device may perform word segmentation processing on each text covered by the target data source, so as to obtain a plurality of third words. For each third participle, the computer device determines a proportion of the number of texts containing the third participle in the target data source to the total number of texts in the target data source, so as to obtain a first proportion corresponding to each third participle. And then, screening out each third participle of which the first proportion exceeds a first proportion threshold value from each third participle, and constructing a target filtering word bank according to the screened out each third participle. Therefore, each third participle with the first ratio exceeding the first ratio threshold is recorded in the constructed target filtering word bank.

In one embodiment, the words without actual meaning for the target data source can also be determined by human experience, and then the target filtering word library is constructed according to the words without actual meaning for the target data source. Therefore, the words which are manually determined and have no actual meaning aiming at the target data source are recorded in the constructed target filtering word bank.

In one embodiment, the target filtering word library may also be constructed according to words which are determined by human beings and have no actual meaning aiming at the target data source, and each third participle of which the first ratio exceeds the first ratio threshold value. Namely, the constructed target filtering word bank simultaneously records the manually determined words without actual meanings aiming at the target data source and each third participle of which the first proportion exceeds the first proportion threshold.

It should be noted that, constructing the filtering lexicon corresponding to the data source may be a preparation work that is completed in advance. Specifically, a filtering lexicon corresponding to each data source may be pre-constructed, after the text to be labeled is obtained, the data source (i.e., the target data source) to which the text to be processed belongs is determined, and when the filtering lexicon corresponding to the target data source (i.e., the target filtering lexicon) needs to be used, the target filtering lexicon is directly found from the pre-constructed filtering lexicon corresponding to each data source, without temporarily constructing the target filtering lexicon. In addition, the filtering word stock corresponding to each data source can be updated periodically.

In one embodiment, in addition to the previously described artificial setting for determining the first proportional threshold, as shown in fig. 6, the following steps may be adopted to determine the first proportional threshold: s602, determining fourth participles from all the third participles according to the current proportion threshold; s604, determining the residual word number corresponding to each text belonging to the target data source; s606, determining a second proportion of the number of texts of which the remaining number of words is equal to or greater than the threshold value of the number of words to the total number of texts of the target data source; s608, when the second proportion does not exceed the second proportion threshold, determining the current proportion threshold as a first proportion threshold; and S610, when the second proportion exceeds a second proportion threshold, updating the current proportion threshold according to the down-regulation value, and returning to the step of determining a fourth word segmentation from all the third word segmentations according to the current proportion threshold.

The fourth segmentation may include third segmentation for which the first ratio is equal to or greater than the current ratio threshold. Specifically, the third participles with the first ratio equal to or greater than the current ratio threshold are screened out from the third participles obtained by performing participle processing on each text belonging to the target data source, and the screened third participles are fourth participles.

The remaining number of words corresponding to the text may be the number of third participles remaining after the fourth participle is removed from each third participle of the text.

For example, as shown in fig. 7, assuming that the texts belonging to the target data source are respectively the texts CT1, CT2, and CT3, the word segmentation processing is performed on CT1 to obtain third words Pc3-1, pc3-2, pc3-3, pc3-4, and Pc3-5, the word segmentation processing is performed on CT2 to obtain third words Pc3-1, pc3-2, pc3-6, pc3-7, and Pc3-8, the word segmentation processing is performed on CT3 to obtain third words Pc3-6, pc3-7, pc3-8, pc3-9, and Pc3-10, and then the third words obtained by the word segmentation processing on CT1, CT2, and CT3 are respectively the third words Pc3-1 to Pc3-10, and 10 third words are counted.

If the third participle Pc3-1 to the third participle Pc3-10, the third participle (i.e., the fourth participle) with the first ratio equal to or greater than the current ratio threshold N1 is Pc3-1, pc3-2, pc3-3, pc3-4, and Pc3-9, respectively. Then, from the third participles (Pc 3-1, pc3-2, pc3-3, pc3-4, and Pc 3-5) of the text CT1, the third participle left after the fourth participle is removed is Pc3-5, that is, the number of remaining words corresponding to the text CT1 is 1 corresponding to the current proportion threshold N1. From the third participles (Pc 3-1, pc3-2, pc3-6, pc3-7 and Pc 3-8) of the text CT2, the third participles left after the fourth participle is removed are Pc3-6, pc3-7 and Pc3-8, that is, the number of the remaining words corresponding to the text CT2 is 3 corresponding to the current proportion threshold N1. From the third participles (Pc 3-6, pc3-7, pc3-8, pc3-9 and Pc 3-10) of the text CT3, the third participles left after the fourth participle is removed are Pc3-6, pc3-7, pc3-8 and Pc3-10, that is, the number of the remaining words corresponding to the text CT3 is 4 corresponding to the current proportion threshold N1.

The second proportion is the proportion of the number of texts with the remaining number of words equal to or greater than the threshold number of words in the total number of texts of the target data source, and the threshold number of words can be determined according to actual requirements. Taking the foregoing example as a support, the texts belonging to the target data source are respectively texts CT1, CT2 and CT3, the total number of the texts is 3, and assuming that the threshold of the number of words is 3, the texts with the number of remaining words equal to or greater than 3 are respectively CT2 and CT3, and the second ratio is

。

And the second proportion threshold is used for measuring whether the current proportion threshold can be used as the first proportion threshold. After the second proportion is determined, judging whether the second proportion exceeds a second proportion threshold, if not, determining the current proportion threshold as a first proportion threshold, and ending the process of determining the first proportion threshold; and if the ratio is larger than the first ratio threshold, the current ratio threshold cannot be used as the first ratio threshold, updating the current ratio threshold according to the down-regulation value, namely subtracting the down-regulation value on the basis of the current ratio threshold, and re-executing the step of determining the fourth word segmentation from the third word segmentation and the subsequent steps according to the updated current ratio threshold. The second proportional threshold may be determined according to actual requirements, and may be set to 90%, for example.

In addition, a data source may correspond to a first scale threshold, and in the process of determining the first scale threshold corresponding to the data source, when the current scale threshold is determined for the first time, the initial scale threshold is determined as the current scale threshold. The initial proportional threshold may be predetermined according to the actual demand, and may be set to 100%, for example.

It should be noted that, compared with the manner in which the first ratio threshold is manually set, the second ratio of texts whose remaining number of words is equal to or greater than the number of words threshold to each text belonging to the target data source is determined according to the current ratio threshold, when it is determined that the second ratio exceeds the second ratio threshold, the current ratio threshold is reduced, and then the second ratio is re-determined, until the second ratio does not exceed the second ratio threshold, the current ratio threshold is determined as the first ratio threshold. In this way, automatic determination of the first proportional threshold by the computer device is achieved and the accuracy of the determined first proportional threshold is improved.

In an embodiment, the step of obtaining the keywords of the text to be processed according to each second word segmentation may include the following steps: carrying out permutation and combination according to the second participles to obtain fifth participles; determining a sixth word segmentation from the fifth word segmentation; determining a seventh participle from the sixth participles; and obtaining the keywords of the text to be processed according to the seventh word segmentation.

And the fifth participle comprises at least two second participles which are continuously adjacent. Specifically, after the first participles belonging to the target filtering lexicon are removed from the first participles, the remaining first participles (namely, the second participles) are arranged and combined according to a preset arrangement and combination rule to obtain all combined words comprising at least two continuously adjacent second participles, wherein each combined word is a fifth participle.

The sixth participle can be a fifth participle belonging to an existing entry. The existing terms may include encyclopedic terms, which are terms that may be searched for by an encyclopedic search service. For example, the encyclopedia entries may include entries that are included in an encyclopedia and entries that are included in a wiki. Specifically, the computer device may determine whether each fifth word belongs to an existing entry, and then use each fifth word belonging to the existing entry as each sixth word.

And the seventh participle is not included in the sixth participles except the sixth participle. If the participle B contains all the contents in the participle A, the participle A is contained in the participle B, and if the participle B only contains part of the contents in the participle A and does not contain all the contents in the participle A, the participle A is not contained in the participle B (the participle A and the participle B are any two different participles, and the description of the word A and the word B is only used for making the naming distinction).

In an embodiment, the keywords of the text to be processed are obtained according to the seventh word segmentation, and specifically, the seventh word segmentation may be used as the keywords of the text to be processed.

In one embodiment, the step of acquiring semantic description information corresponding to each keyword, that is, step S204, may include the following steps: acquiring network search information corresponding to each keyword respectively; and obtaining semantic description information respectively corresponding to the keywords according to the network search information respectively corresponding to the keywords.

The network search information corresponding to the keyword can be obtained by searching the keyword through network search service. The web search service may be an information search service based on the internet, and may include at least one of a web search service and an encyclopedia search service. Web search services such as a hundred degree web search service, google web search service, and the like. An encyclopedia search service such as, but not limited to, an encyclopedia search, a wikipedia search service, etc.

In one embodiment, the network search information may include target search results from searching the keyword through a network search service. The target search result may include several search results with the highest relevance among all the search results obtained by the search (the specific number may be set according to actual requirements). For example, generally, the search results obtained by searching for the keyword through the web search service are sorted from high to low in relevancy, and the relevancy decreases from front to back, so that the top 50 search results can be used as the target search results.

In an embodiment, the semantic description information corresponding to the keyword is obtained according to the web search information corresponding to the keyword, and specifically, the semantic description information corresponding to the keyword includes the web search information corresponding to the keyword.

In another embodiment, in combination with the foregoing description, the semantic description information corresponding to the keyword may also be determined together according to the expert description information corresponding to the keyword and the network search information corresponding to the keyword. Specifically, the semantic description information corresponding to the keyword may include both the web search information corresponding to the keyword and the expert description information corresponding to the keyword.

In one embodiment, the step of obtaining the search results corresponding to the keywords respectively may include the following steps: and calling a network search service to search the keywords respectively to obtain network search results corresponding to the keywords respectively.

In this embodiment, each time a web search result corresponding to a keyword needs to be obtained, the web search service is temporarily invoked to search the keyword, so as to obtain a web search result corresponding to the keyword.

In one embodiment, the step of obtaining the network search information corresponding to each keyword may include the following steps: respectively searching candidate keywords corresponding to the keywords in a local information base; recording a matching relation between the candidate keywords and candidate search information by a local information base, wherein the candidate search information is obtained by searching corresponding candidate keywords through a network search service; when the candidate keywords corresponding to the keywords are found, obtaining network search information corresponding to the keywords according to the candidate search information matched with the found candidate keywords; and when the candidate keyword corresponding to the keyword is not found, calling a network search service to search the keyword to obtain network search information corresponding to the keyword.

In this embodiment, a network search service may be called in advance to search each candidate keyword, the searched target search results corresponding to each candidate keyword are each candidate search information, a database in which each candidate keyword, each candidate search information, and a matching relationship between each candidate keyword and each candidate search information are recorded is further generated, and then the content in the database is stored in a computer device to obtain a local information base.

When network search information corresponding to the keywords needs to be acquired subsequently, the computer device can directly search the candidate keywords corresponding to the keywords in the local information base. The candidate keywords corresponding to the keywords are found in the local information base, which indicates that the keywords are searched by calling the network search service in advance, and at this moment, the candidate search information matched with the candidate keywords corresponding to the keywords can be directly used as the network search information corresponding to the keywords without repeatedly calling the network search service to search the keywords.

On the contrary, the fact that the candidate keyword corresponding to the keyword is not found in the local information base indicates that the keyword is not searched by invoking the web search service in advance, and the candidate search information which can be used as the web search information corresponding to the keyword is not stored in the local information base. At this time, the computer device may temporarily invoke the web search service to search for the keyword, and the searched target search result corresponding to the keyword is the web search information corresponding to the keyword. In addition, the keyword and the searched target search result corresponding to the keyword can be used as new candidate keywords and new candidate search information, and are updated to a local information base and a database in which the candidate keywords, the candidate search information and the matching relationship between the candidate keywords and the candidate search information are recorded.

It should be noted that the candidate keywords corresponding to the keywords are not found in the local information base, and then the network search service is invoked to search the keywords, so that the efficiency of determining the category to which the text to be processed belongs can be greatly improved. In practical application, the number of frequently appearing keywords in the text is relatively limited, and after ten million levels of candidate keywords are accumulated in the local information base, the external network search service is rarely required to be called to obtain the network search information corresponding to the keywords, so that the task of determining the category to which a large amount of texts belong can be completed very efficiently.

In addition, the database in which the candidate keywords, the candidate search information, and the matching relationship between the candidate keywords and the candidate search information are recorded may be updated at regular time intervals. For example, every predetermined number of days, the network search service is recalled to search for each candidate keyword in the database, so as to update each candidate search information corresponding to each candidate keyword.

In an embodiment, the step of obtaining semantic description information corresponding to each keyword according to the network search information corresponding to each keyword may include the following steps: and respectively carrying out data cleaning on the network search information respectively corresponding to the keywords to obtain semantic description information respectively corresponding to the keywords.

And the data cleaning can remove information irrelevant to the keyword from the network search information corresponding to the keyword. Accordingly, the semantic description information corresponding to the keyword may include information left after removing information irrelevant to the keyword itself. The information irrelevant to the keyword itself may include, but is not limited to, a date, a website name, video playing information, music playing information, and a common website.

For example, the network search information of the keyword is "for" 2018, bean segment 8.5, which is comparable to fantasy drift of juveniles pie "| search entertainment _ search network [ online play ], and after data cleaning, the following information is removed: "2018 years", "_ search fox entertainment", "_ search fox net", and "[ online play ].

In an embodiment, the step of determining the first relevance of each keyword to each candidate category according to each semantic description information, that is, step S206, may include the following steps: determining a third degree of correlation between each semantic description information and each candidate category according to each semantic description information and the category name of each candidate category; and determining the first correlation degree of each keyword and each candidate category according to each third correlation degree.

The name of the candidate category is the name of the candidate category. The category name may be a name that includes only a single hierarchy, such as "cell phone app. The category name may also be a name including more than one hierarchy, and thus, each hierarchy may be separated by using a predetermined connector, for example, "mobile app-game-moba" is a category name including 3 hierarchies, and each hierarchy is separated by using a connector of "-".

The third correlation between the semantic description information and the candidate category is determined according to the semantic description information and the category name of the candidate category, and is a metric value which can be used for measuring the matching degree between the semantic description information and the candidate category. The value range of the third correlation may be [0, +1], and the larger the third correlation is, the higher the matching degree between the semantic description information and the candidate category is according to the semantic description information and the category name of the candidate category is, whereas the smaller the third correlation is, the lower the matching degree between the keyword and the candidate category is according to the semantic description information and the category name of the candidate category is according to the keyword.

According to the foregoing description, for each keyword, the computer device may determine, according to the semantic description information corresponding to the keyword and the category description information of each candidate category, a first relevance between the keyword and each candidate category. In this embodiment, the category description information of the candidate category may include a category name of the candidate category, and accordingly, for each keyword, the computer device determines, according to the semantic description information corresponding to the keyword and the category name of each candidate category, a third degree of correlation between the semantic description information corresponding to the keyword and each candidate category, respectively. And then, according to the third correlation degree between the semantic description information corresponding to the keyword and each candidate category, determining the first correlation degree between the keyword and each candidate category.

For example, as shown in fig. 8, the keywords of the text LT1 to be processed are the keyword Kw1, the keyword Kw2, and the keyword Kw3, the keyword Kw1 corresponds to the semantic description information Sd1, the keyword Kw2 corresponds to the semantic description information Sd2, and the keyword Kw3 corresponds to the semantic description information Sd3, and the candidate categories are the candidate category C1, the candidate category C2, and the candidate category C3, respectively.

Accordingly, the computer device determines a third degree of correlation of the semantic description information Sd1 with the candidate category C1 according to the semantic description information Sd1 and the category name of the candidate category C1, thereby determining a first degree of correlation of the keyword Kw1 with the candidate category C1 according to the third degree of correlation of the semantic description information Sd1 with the candidate category C1.

The computer device determines a third degree of correlation between the semantic description information Sd1 and the candidate category C2 according to the semantic description information Sd1 and the category name of the candidate category C2, thereby determining a first degree of correlation between the keyword Kw1 and the candidate category C2 according to the third degree of correlation between the semantic description information Sd1 and the candidate category C2.

The computer device determines a third degree of correlation between the semantic description information Sd1 and the candidate category C3 according to the semantic description information Sd1 and the category name of the candidate category C3, thereby determining a first degree of correlation between the keyword Kw1 and the candidate category C3 according to the third degree of correlation between the semantic description information Sd1 and the candidate category C3.

By analogy, the first correlation degrees of the keyword Kw2 with the candidate category C1, the candidate category C2 and the candidate category C3 respectively are determined, and the first correlation degrees of the keyword Kw3 with the candidate category C1, the candidate category C2 and the candidate category C3 respectively are determined.

In one embodiment, the third degree of correlation between the semantic description information corresponding to the keyword and the candidate category is the first degree of correlation between the keyword and the candidate category. For example, the third correlation between the semantic description information Sd1 and the candidate category C1 is the first correlation between the keyword Kw1 and the candidate category C1.

In an embodiment, the step of determining the third degree of correlation between each semantic description information and each candidate category according to each semantic description information and the category name of each candidate category may include the following steps: determining common words of the semantic description information and the category names of the candidate categories according to the semantic description information and the category names of the candidate categories; determining target common words of each semantic description information and the category names of the candidate categories from the common words of each semantic description information and the category names of the candidate categories; determining a third ratio of the total word length of the target common words of the semantic description information and the category names of the candidate categories to the total word length of the category names of the candidate categories; and determining a third degree of correlation between each semantic description information and the category name of each candidate category according to each third proportion, the first word frequency of each semantic description information and the target common word of the category name of each candidate category in the corresponding semantic description information, and the first inverse document frequency of each semantic description information and the target common word of the category name of each candidate category.

The common word of the semantic description information and the category name of the candidate category is a participle commonly contained in the semantic description information and the category name of the candidate category. For example, the semantic description information is "royal glory" which is a moba-type mobile phone game operated on android and ios platform developed and run by Tencent game, "the name of the candidate category is" mobile phone app-game-moba, "and the common words of the two are" mobile phone, "hand," "machine," "game," "moba," "m," "o," "b," "a," "mo," "ob," "ba," and so on.

In the present embodiment, for each semantic description information, the computer device determines a common word of the semantic description information and a category name of each candidate category, respectively. For example, there are 3 semantic descriptors: semantic description information Sd1, semantic description information Sd2, and semantic description information Sd3, and there are 3 candidate categories: candidate category C1, candidate category C2, and candidate category C3, then common words of semantic description information Sd1 and candidate category C1, common words of semantic description information Sd1 and candidate category C2, and common words of semantic description information Sd1 and candidate category C3 are determined, similarly, common words of semantic description information Sd2 and candidate category C1, candidate category C2, and candidate category C3, respectively, and common words of semantic description information Sd3 and candidate category C1, candidate category C2, and candidate category C3, respectively, are determined.

The target common word between the semantic description information and the category name of the candidate category is not included in the common words except for the target common word among the semantic description information and the category name of the candidate category. Similar to the foregoing limitation on the seventh participle, if the shared word D includes all the contents of the shared word C, the shared word C is included in the shared word D, and if the shared word D includes only part of the contents of the shared word C and does not include all the contents of the shared word C, the shared word C is not included in the shared word D (the shared word C and the shared word D are any two shared words different from each other, and the description of "C" and "D" is only for making a distinction in naming).

For example, the common words of the category names of the semantic description information and the candidate category are respectively: "cell phone", "hand", "machine". For these 3 common words, since "hand" only includes a part of the content of "mobile phone," hand "is included in" mobile phone, "hand" is not the target common word. Since "machine" also includes only a part of the contents of "mobile phone", the "machine" is also included in "mobile phone", and "machine" is not a target common word. Only "mobile phone" is not included in any common word other than itself (neither included in "hand" nor "mobile phone"), and therefore the target common word determined from these 3 common words is "mobile phone".

In this embodiment, for each piece of semantic description information, the computer device determines a target common word of the semantic description information and the category name of each candidate category from the common words of the semantic description information and the category name of each candidate category, respectively.

In connection with the foregoing example, the computer device determines a target common word of the semantic description information Sd1 and the candidate category C1 from the common words of the semantic description information Sd1 and the candidate category C1, determines a target common word of the semantic description information Sd1 and the candidate category C2 from the common words of the semantic description information Sd1 and the candidate category C2, and determines a target common word of the semantic description information Sd1 and the candidate category C3 from the common words of the semantic description information Sd1 and the candidate category C3.

The computer device determines target common words of the semantic description information Sd2 and the candidate category C1, the candidate category C2 and the candidate category C3 from the common words of the semantic description information Sd2 and the candidate category C1, the candidate category C2 and the candidate category C3 respectively.

The computer device determines target common words of the semantic description information Sd3 and the candidate category C1, the candidate category C2 and the candidate category C3 from the common words of the semantic description information Sd3 and the candidate category C1, the candidate category C2 and the candidate category C3 respectively.

The third ratio is a ratio of the total word length of the target common words of the semantic description information and the category name of the candidate category to the total word length of the category name of the candidate category. For example, if the target name of the candidate category is "mobile app-game-moba", and assuming that the target common words of the semantic description information and the category name of the candidate category are "mobile phone", "game" and "moba", the total word length of the semantic description information and the target common words of the category name of the candidate category is 8 (the "mobile phone" is 2, the "game" is 2, the "moba" is 4, and the total is 8), the total word length of the category name of the candidate category is 11 (since the total word length is calculated, 3 "-" connectors are not included, and the total length of the "mobile app game moba" is 11), the third ratio is that

。

In this embodiment, for each semantic description information, the computer device determines a third ratio of the total word length of the semantic description information and the target common word of the category name of each candidate category to the total word length of the category name of each candidate category, respectively.

In connection with the foregoing example, the computer device determines a third ratio of the total word length of the semantic description information Sd1 and the target common words of the candidate category C1 to the total word length of the category name of the candidate category C1, determines a third ratio of the total word length of the semantic description information Sd1 and the target common words of the candidate category C2 to the total word length of the category name of the candidate category C1, and determines a third ratio of the total word length of the semantic description information Sd1 and the target common words of the candidate category C3 to the total word length of the category name of the candidate category C1.

The computer device determines a third ratio of the total word length of the semantic description information Sd2 and the target common words of the candidate category C1, the candidate category C2, and the candidate category C3, respectively, to the total word length of the category names of the candidate category C1, the candidate category C2, and the candidate category C3, respectively.

The computer device determines a third ratio of the total word length of the semantic description information Sd3 and the target common words of the candidate category C1, the candidate category C2, and the candidate category C3, respectively, to the total word length of the category names of the candidate category C1, the candidate category C2, and the candidate category C3, respectively.

The first word frequency of the target common word of the semantic description information and the category name of the candidate category in the semantic description information is the frequency of the target common word appearing in the semantic description information. For example, the semantic description information is "royal glory" which is a moba type mobile phone game operated on an android and ios platform developed and operated by Tencent game, and the semantic description information and the target common words of the category names of the candidate categories are respectively: the first word frequency of the 3 target common words of the mobile phone, the game and the moba in the semantic description information is 1.

Similar to the foregoing definition of the inverse document frequency of the keyword, the first inverse document frequency of the target common word of the semantic description information and the category name of the candidate category may be:

。

in one embodiment, the target corpus may be a corpus corresponding to a web search service. Accordingly, the number of objects including the target common word in all the objects of the target corpus may be the total number of all search results obtained by calling a network search service to search for the target common word. The number of all objects in the target corpus may be set to a predetermined value, such as: 1 +100000000.

In this embodiment, for each semantic description information, the computer device determines a third degree of correlation between the semantic description information and each candidate category name respectively according to a third ratio of the total word length of the semantic description information and the target common word of each candidate category name to the total word length of each candidate category name, a first word frequency of the semantic description information and the target common word of each candidate category name respectively, and a first inverse document frequency of the semantic description information and the target common word of each candidate category name respectively.

In one embodiment, for any semantic description information and any candidate category, the third correlation degree between the semantic description information and the candidate category may be:

. Wherein N represents a total number of target common words of the semantic description information and the category name of the candidate category, N being an integer equal to or greater than 1;

the total word length of N target common words representing the semantic description information and the category name of the candidate category accounts for a third proportion of the total word length of the category name of the candidate category;

representing the first word frequency of the ith target common word in the semantic description information in the N target common words;

a first inverse document frequency representing an ith target common word of the N target common words.

In another embodiment, for any semantic description information and any candidate category, the third correlation between the semantic description information and the candidate category may also be:

. Wherein the content of the first and second substances,

and the word length of the ith target common word in the N target common words is represented.

It should be noted that, when the calculated value of the third correlation between the semantic description information and the candidate category is greater than 1, it may be set to 1.

In one embodiment, after determining the common words of the semantic description information and the category name, the determined common words may be stored in a set manner, that is, a common word set is formed, and after determining the target common words of the semantic description information and the category name, the determined target common words may be stored in a set manner, that is, a target common word set is formed.

In one embodiment, for semantic description information and category names containing capital-format english characters, the capital-format english characters can be converted into a lower case format to unify data formats before determining common words of the semantic description information and the category names.

In one embodiment, the method for determining text categories may further include the steps of: and determining the fourth degree of correlation between each semantic description information and each candidate category according to each semantic description information and the predetermined correlation coefficient between the predetermined category associated word of each candidate category and the corresponding candidate category. Accordingly, the step of determining the first degree of correlation between each keyword and each candidate category according to each third degree of correlation may include the following steps: and determining the first relevance of each keyword and each candidate category according to each third relevance and each fourth relevance.

The predetermined category related word of the candidate category is a word determined manually and having a correlation with the candidate category. And the correlation coefficient between the preset category relevant word of the candidate category and the candidate category is used for representing the correlation condition between the preset category relevant word and the candidate category. The preset category related words of the candidate categories and the correlation coefficients between the preset category related words of the candidate categories and the candidate categories can be determined manually in advance according to experience accumulated in actual services.

The value range of the correlation coefficient is [ -1, +1], when the correlation coefficient of the predetermined category related word of the candidate category and the candidate category is positive, the predetermined category related word and the candidate category are represented to be in positive correlation, the larger the correlation coefficient is, the higher the degree of positive correlation is, and the smaller the correlation coefficient is, the lower the degree of positive correlation is. When the correlation coefficient between the predetermined category related word of the candidate category and the candidate category is negative, it indicates that the predetermined category related word and the candidate category are negatively correlated, and the larger the correlation coefficient is, the lower the degree of negative correlation is, and the smaller the correlation coefficient is, the higher the degree of negative correlation is.

The fourth degree of correlation between the semantic description information and the candidate category is determined according to the semantic description information and the correlation coefficient between the predetermined category associated word of the candidate category and the candidate category, and is a metric value which can be used for measuring the matching degree between the semantic description information and the candidate category. The value range of the fourth correlation degree may be [0, +1], and the larger the fourth correlation degree is, the higher the matching degree between the semantic description information and the candidate category is according to the semantic description information and the correlation coefficient between the predetermined category related word of the candidate category and the candidate category is, whereas the smaller the fourth correlation degree is, the lower the matching degree between the keyword and the candidate category is according to the semantic description information and the correlation coefficient between the predetermined category related word of the candidate category and the candidate category is.

According to the foregoing description, for each keyword, the computer device may determine, according to the semantic description information corresponding to the keyword and the category description information of each candidate category, a first relevance between the keyword and each candidate category. In this embodiment, the category description information of the candidate category may include a predetermined category related word of the candidate category and a predetermined correlation coefficient between the predetermined category related word and the corresponding candidate category, and for each keyword, the computer device determines a fourth correlation degree between the semantic description information corresponding to the keyword and each candidate category according to the semantic description information corresponding to the keyword, the predetermined category related word of each candidate category, and the respective correlation coefficient of the corresponding candidate category. And determining the first relevance of the keyword and each candidate category according to the third relevance of the semantic description information corresponding to the keyword and each candidate category and the fourth relevance of the semantic description information corresponding to the keyword and each candidate category.

For example, the keywords of the text LT1 to be processed are the keyword Kw1, the keyword Kw2, and the keyword Kw3, the keyword Kw1 corresponds to the semantic description information Sd1, the keyword Kw2 corresponds to the semantic description information Sd2, and the keyword Kw3 corresponds to the semantic description information Sd3, and the candidate categories are the candidate category C1, the candidate category C2, and the candidate category C3, respectively.

Accordingly, as shown in fig. 9, the computer device determines a third degree of correlation between the semantic description information Sd1 and the candidate category C1 according to the semantic description information Sd1 and the category name of the candidate category C1, determines a fourth degree of correlation between the semantic description information Sd1 and the candidate category C1 according to the predetermined category related word between the semantic description information Sd1 and the candidate category C1 and the correlation coefficient between the predetermined category related word and the candidate category C1, and further determines a first degree of correlation between the keyword Kw1 and the candidate category C1 together according to the third degree of correlation between the semantic description information Sd1 and the candidate category C1 and the fourth degree of correlation between the semantic description information Sd1 and the candidate category C1.

The computer equipment determines a third degree of correlation between the semantic description information Sd1 and the candidate category C2 according to the semantic description information Sd1 and the category name of the candidate category C2, determines a fourth degree of correlation between the semantic description information Sd1 and the candidate category C2 according to a predetermined category related word between the semantic description information Sd1 and the candidate category C2 and a correlation coefficient between the predetermined category related word and the candidate category C2, and further determines a first degree of correlation between the keyword Kw1 and the candidate category C2 together according to the third degree of correlation between the semantic description information Sd1 and the candidate category C2 and the fourth degree of correlation between the semantic description information Sd1 and the candidate category C2.

The computer equipment determines a third degree of correlation between the semantic description information Sd1 and the candidate category C3 according to the semantic description information Sd1 and the category name of the candidate category C3, determines a fourth degree of correlation between the semantic description information Sd1 and the candidate category C3 according to a predetermined category related word between the semantic description information Sd1 and the candidate category C3 and a correlation coefficient between the predetermined category related word and the candidate category C3, and further determines a first degree of correlation between the keyword Kw1 and the candidate category C3 together according to the third degree of correlation between the semantic description information Sd1 and the candidate category C3 and the fourth degree of correlation between the semantic description information Sd1 and the candidate category C3.

By analogy, the computer device determines the first correlation degree between the keyword Kw2 and the candidate category C1, the candidate category C2, and the candidate category C3, respectively. And, determining a first correlation degree between the keyword Kw3 and each of the candidate category C1, the candidate category C2, and the candidate category C3.

Specifically, for any keyword and any candidate category, a common summation may be performed according to the third degree of correlation between the semantic description information corresponding to the keyword and the candidate category, and the fourth degree of correlation between the semantic description information corresponding to the keyword and the candidate category, so as to obtain the first degree of correlation between the keyword and the candidate category. For example, a common summation is performed on the third correlation between the semantic description information Sd1 and the candidate category C1 and the fourth correlation between the semantic description information Sd1 and the candidate category C1, so as to obtain a first correlation between the keyword Kw1 and the candidate category C1.

Or, weights may be set for the third degree of correlation and the fourth degree of correlation, respectively, and a weighted sum is performed according to the semantic description information corresponding to the keyword and the third degree of correlation, the weight corresponding to the third degree of correlation, the semantic description information corresponding to the keyword and the fourth degree of correlation, and the weight corresponding to the fourth degree of correlation of the candidate category, so as to obtain the first degree of correlation between the keyword and the candidate category. For example, the first correlation degree between the keyword Kw1 and the candidate category C1 may be obtained by performing weighted summation on the semantic description information Sd1 and the third correlation degree of the candidate category C1, the weight corresponding to the third correlation degree, the fourth correlation degree of the semantic description information Sd1 and the candidate category C1, and the weight corresponding to the fourth correlation degree.

In specific implementation, an associated knowledge base can be constructed in advance, and the associated knowledge base comprises a plurality of pieces of manually determined associated knowledge. In this case, the computer device may obtain the predetermined correlation coefficient between the predetermined category related word of each candidate category and the corresponding candidate category according to the association knowledge base, and then determine the fourth correlation degree between each semantic description information and each candidate category according to each semantic description information and the predetermined correlation coefficient between the predetermined category related word of each candidate category and the corresponding candidate category.

In one embodiment, the data format of the associated knowledge may be "category identification of the candidate category (such as category ID), category name of the candidate category, predetermined category associated word of the candidate category, and correlation coefficient of the predetermined category associated word and the candidate category".

The following illustrates the associated knowledge for three examples:

1, mobile app-game-moba, glory of the king, 0.8

1, mobile app-game-moba, ranking, 0.2

1, mobile app-game-moba, hero alliance, -0.9

The "royal glory" belongs to one of the "mobile app-game-moba", so that the degree of positive correlation between the "royal glory" and the candidate category of the "mobile app-game-moba" is very high, and the correlation coefficient between the "royal glory" and the "mobile app-game-moba" can be manually set to 0.8. The 'ranking' is only weakly related to 'mobile app-game-moba', so that the correlation coefficient of the two can be set to 0.2 manually. The hero alliance is related to the game-moba but not the mobile phone app, so the hero alliance is a potential confusion word of the candidate category of the mobile phone app-game-moba, the hero alliance is very high in degree of negative correlation with the candidate category, and the correlation coefficient of the hero alliance and the candidate category can be set to-0.9 manually.

In other embodiments, the first degree of association between each keyword and each candidate category may be determined only according to the fourth degree of association between each semantic description information and each candidate category, without considering the third degree of association between each semantic description information and each candidate category.

It should be noted that the preset category associated word determined manually and the preset correlation coefficient of the preset category associated word and the corresponding candidate category are taken as the consideration factor for determining the first correlation degree of the keyword and the candidate category, and the manual intervention on the automatic labeling process is allowed, so that the related business personnel can improve the quality of the category labeling result according to the experience accumulated in the real business scene, and industrial-grade manual controllability and manual optimization are realized.

In addition, in a practical application scenario, the automatic annotation system described above can also provide an associated knowledge entry service. As shown in fig. 10, the associated knowledge entry interface is that a user may click a control 1001, input the associated knowledge determined by the user in an associated knowledge input box 1002, and then click a control 1003 to complete manual entry of the corresponding associated knowledge. In addition, the user can click on a control 1004 to modify or delete the entered corresponding associated knowledge.

In one embodiment, the step of determining the fourth degree of correlation between each semantic description information and each candidate category according to each semantic description information, the predetermined category associated word of each candidate category, and the predetermined correlation coefficient of the corresponding candidate category may include the following steps: and determining fourth correlation degrees of the semantic description information and the candidate categories respectively according to the second word frequency of the predetermined category related words of the candidate categories in the semantic description information, the second inverse document frequency of the predetermined category related words of the candidate categories and the predetermined correlation coefficients of the predetermined category related words of the candidate categories and the corresponding candidate categories respectively.

The second word frequency of the predetermined category related word of the candidate category in the semantic description information is the number of times the predetermined category related word appears in the semantic description information. For example, the category related word is 'Royal of King', the semantic description information is 'Royal of King', the moba type mobile phone game operated on the android and ios platform is developed and operated by Tencent game, and the second word frequency of the predetermined category related word in the semantic description information is 1.

Similar to the above description of the inverse document frequency of the keyword, the second inverse document frequency of the predetermined category associated word of the candidate category may be:

。

in one embodiment, the target corpus may be a corpus corresponding to a web search service. Accordingly, the number of objects including the predetermined category related word in all the objects of the target corpus may be the total number of all search results obtained by searching the predetermined category related word through the network search service, and the number of all the objects in the target corpus may be set to a predetermined value, for example: 1 +100000000.

In this embodiment, for each piece of semantic description information, the computer device determines a fourth degree of association between the semantic description information and the category name of each candidate category according to the second word frequency of the predetermined category associated word of each candidate category in the semantic description information, the second inverse document frequency of the predetermined category associated word of each candidate category, and the correlation coefficient between the predetermined category associated word of each candidate category and the corresponding candidate category.

In one embodiment, for any semantic description information and any candidate category, the fourth degree of correlation between the semantic description information and the candidate category may be:

. Wherein M represents a total number of predetermined category related words of the candidate category, M being an integer equal to or greater than 1;

a second word frequency of the jth category associated word in the M category associated words in the semantic description information is represented;

a second inverse document frequency representing a jth category associated word in the M category associated words;

and the correlation coefficient of the jth category related word in the M category related words and the candidate category is represented.

In another embodiment, for any semantic description information and any candidate category, the fourth degree of correlation between the semantic description information and the candidate category may also be:

. Wherein, the first and the second end of the pipe are connected with each other,

and the word length of the jth category related word in the M category related words is represented.

It should be noted that, when the calculated value of the fourth correlation between the semantic description information and the candidate class is greater than 1, it may be set to 1.

The method comprises the steps of determining text description information corresponding to each keyword according to network description information corresponding to each keyword of a text to be processed, determining second relevance of the text to be processed and each candidate category according to at least one of third relevance of each semantic description information and each candidate category (determined according to each semantic description information and category names of each candidate category) and fourth relevance of each semantic description information and each candidate category (determined according to each semantic description information, predetermined category relevant words of each candidate category and predetermined relevant coefficients of corresponding candidate categories), and further determining second relevance of the text to be processed and each candidate category according to the second relevance of the text to be processed and each candidate category. There is a basic assumption: the keywords extracted from the text are searched through a network search service, and if the name or category associated word of a certain candidate category frequently appears in each obtained search result, the text is closely related to the candidate category. A specific analysis of this basic assumption is as follows: the purpose of the web search service is to provide the content and description most relevant to the search input information, such as the keyword "husky" and the candidate category "pet-dog" are closely related, and the search of "husky" by the web search service frequently shows two keywords "pet" and "dog" in each search result, so that the basic assumption is true in practical application.

In one embodiment, the method for determining the text category may further include the steps of: and acquiring the priority coefficient of each candidate category. Accordingly, the step of determining the category to which the text to be processed belongs from the candidate categories according to the second relevance, that is, step S210, may include the steps of: determining fifth relevance of the text to be processed and each candidate category according to each second relevance and the priority coefficient of each candidate category; and determining the category to which the text to be processed belongs from the candidate categories according to the fifth correlation degrees.

The priority coefficient of the candidate category can be determined manually, and is used for representing the priority degree of the candidate category determined by related personnel in actual business.

In this embodiment, for each candidate category, the computer device determines a fifth degree of correlation of the candidate category according to the second degree of correlation of the text to be processed and the candidate category and the priority coefficient of the candidate category. Specifically, the fifth relevance of the candidate category may be obtained by multiplying the second relevance of the text to be processed and the candidate category by the priority coefficient of the candidate category.

In addition, when the category labeling result corresponding to the text to be processed is output, the relevance between the text to be processed and the category to which the text to be processed belongs in the category labeling result may be the fifth relevance between the text to be processed and the category to which the text to be processed belongs.

It should be noted that the second correlation between the text to be processed and each candidate category is modified through the priority coefficient of each candidate category to obtain the fifth correlation between the text to be processed and each candidate category, and then the category to which the text to be processed belongs is determined from each candidate category according to each fifth correlation, so that the candidate categories which need to be considered preferentially can be processed flexibly.

In addition, in an actual application scenario, the automatic labeling system described above may also provide category priority information entry service. As shown in fig. 11, a user may click a control 1101, input user-determined category priority information (such as a category ID, a category name, and a priority coefficient) into a category priority information input box 1102, and then click 1103 the control to complete manual entry of corresponding category priority information. In addition, the user can click on the control 1104 to modify or delete the entered priority information of the corresponding category.

In one embodiment, as shown in FIG. 12, a method for determining text categories is provided. The method may be executed by a computer device, and may specifically include the following steps S1202 to S1224.

And S1202, extracting keywords of the text to be processed, and determining the weight of each keyword.

S1204, searching for candidate keywords corresponding to each keyword in a local information base respectively; the local information base records the matching relation between the candidate keywords and the candidate search information, and the candidate search information is obtained by searching corresponding candidate keywords through a network search service.

And S1206, when the candidate keyword corresponding to the keyword is found, obtaining the network search information corresponding to the keyword according to the candidate search information matched with the found candidate keyword.

And S1208, when the candidate keyword corresponding to the keyword is not found, calling a network search service to search the keyword to obtain network search information corresponding to the keyword.

S1210, respectively performing data cleaning on the network search information respectively corresponding to the keywords to obtain semantic description information respectively corresponding to the keywords.

And S1212, determining a third degree of correlation between each semantic description information and each candidate category according to each semantic description information and the category name of each candidate category.

S1214, determining fourth correlation degrees of each semantic description information and each candidate category according to each semantic description information and the predetermined correlation coefficient of the predetermined category associated word of each candidate category and the corresponding candidate category; the preset category relevant words of the candidate categories and the correlation coefficient between the preset category relevant words of the candidate categories and the candidate categories are determined manually.

S1216, determining a first degree of correlation between each keyword and each candidate category according to each third degree of correlation and each fourth degree of correlation.

And S1218, determining second relevance between the text to be processed and each candidate category according to the weight of each keyword and each first relevance.

And S1220, acquiring the priority coefficient of each candidate category, and determining the fifth relevance of the text to be processed and each candidate category according to each second relevance and the priority coefficient of each candidate category.

And S1222, determining the category to which the text to be processed belongs from the candidate categories according to the fifth relevance.

S1224, outputting a category labeling result corresponding to the text to be processed, where the category labeling result corresponding to the text to be processed includes the text to be processed, a category to which the text to be processed belongs, and a fifth degree of correlation between the text to be processed and the category to which the text to be processed belongs.

It should be noted that specific limitations on each technical feature in this embodiment may be the same as the limitations on the corresponding technical feature in the foregoing, and are not repeated herein.

It should be understood that, under reasonable circumstances, although the steps in the flowcharts referred to in the foregoing embodiments are shown in sequence as indicated by the arrows, the steps are not necessarily executed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in each flowchart may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

In addition, the following illustrates the performance of the automatic labeling system using the text object determination method provided by the embodiment of the present application: the automatic labeling system is applied to label news titles to a category system comprising 800 candidate categories, each candidate category comprises 4 levels, the first-level category labeling accuracy rate is 91.5%, the second-level category labeling accuracy rate is 83.1%, the third-level category labeling accuracy rate is 78.4%, and the fourth-level category labeling accuracy rate is 74.1%. Wherein, the marking Accuracy (Accuracy) = correctly marking as the text/total text quantity of the corresponding candidate class.

In addition, the automatic labeling system does not need to consume any manpower to label the training samples in the labeling process, and if the traditional mode is adopted, 800 ten thousand training samples (800 candidate categories, each candidate category is labeled with 1 ten thousand) need to be manually labeled, so that a large amount of manpower and material resources are needed, the quality of manually labeling a large number of training samples is difficult to guarantee, and the traditional mode does not have actual availability for labeling a complex category system comprising a plurality of candidate categories.

In one embodiment, as shown in fig. 13, a device 1300 for determining text categories is provided. The apparatus 1300 may include the following modules 1302 to 1310.

A keyword processing module 1302, configured to extract keywords of a text to be processed, and determine weights of the keywords;

a semantic description information obtaining module 1304, configured to obtain semantic description information corresponding to each keyword;

a first relevancy determining module 1306, configured to determine, according to each semantic description information, first relevancy between each keyword and each candidate category;

a second relevance determining module 1308, configured to determine, according to the weight of each keyword and each first relevance, second relevance between the text to be processed and each candidate category;

a text category determining module 1310, configured to determine, according to each second correlation, a category to which the text to be processed belongs from each candidate category.

The text category determining device extracts keywords of the text to be processed, obtains weights of the keywords, acquires semantic description information corresponding to the keywords, determines first correlation degrees of the keywords and candidate categories according to the semantic description information, determines second correlation degrees of the text to be processed and the candidate categories according to the weights of the keywords and the first correlation degrees, and determines the category to which the text to be processed belongs from the candidate categories according to the second correlation degrees. Therefore, under the condition that no text with known belonged categories exists, the category to which any text belongs can be automatically determined in the whole process by the computer equipment, so that the link of manually marking the category is omitted, the labor cost is saved, and the dependence of determining the quality of the category to which the text to be processed belongs on the quality of manual marking is eliminated. In addition, the intermediate logic is understandable for a human from the process of acquiring the text to be processed to the process of determining the category to which the text to be processed belongs, so that the process of determining the category to which the text to be processed belongs by manually accumulating business knowledge is possible.

In one embodiment, the keyword processing module 1302 may include the following elements: the first word segmentation acquisition unit is used for carrying out word segmentation on the text to be processed to obtain a plurality of first words of the text to be processed; the second participle obtaining unit is used for removing the first participles belonging to the target filtering word bank from each first participle to obtain one or more second participles; the second segmentation comprises the first segmentation left after the removal; the keyword acquisition unit is used for acquiring keywords of the text to be processed according to the second word segmentation; the target filtering word bank comprises a filtering word bank corresponding to a target data source, and the target data source comprises a data source to which the text to be processed belongs.

In one embodiment, the apparatus 1300 for determining text category may further include the following modules: the third word segmentation acquisition module is used for carrying out word segmentation processing on each text belonging to the target data source to obtain a plurality of third words; the first proportion determining module is used for respectively determining a first proportion corresponding to each third participle; the first proportion corresponding to the third participle comprises: the target data source comprises the proportion of the number of texts containing the third word segmentation to the total number of texts of the target data source; and the target filtering word bank building module is used for building a target filtering word bank according to the third participle of which the first proportion exceeds the first proportion threshold.

In an embodiment, the apparatus 1300 for determining a text category may further include a first proportional threshold determining module, configured to determine a fourth participle from the third participles according to the current proportional threshold; the fourth participles comprise third participles of which the first proportion is equal to or greater than the current proportion threshold; determining the residual word number corresponding to each text belonging to the target data source; the residual word number corresponding to the text is the number of the third participles left after the fourth participle is removed from all the third participles of the text; determining a second proportion of the number of texts of which the remaining number of words is equal to or greater than the threshold number of words to the total number of texts belonging to the target data source; when the second proportion does not exceed the second proportion threshold, determining the current proportion threshold as a first proportion threshold; and when the second proportion exceeds a second proportion threshold value, updating the current proportion threshold value according to the down-regulation numerical value, and returning to the step of determining a fourth word segmentation from all the third word segmentations according to the current proportion threshold value.

In one embodiment, the keyword obtaining unit may include the following sub-units: a fifth participle obtaining subunit, configured to perform permutation and combination according to each second participle, to obtain a fifth participle; each fifth participle comprises at least two second participles which are continuously adjacent; a sixth participle obtaining subunit, configured to determine a sixth participle from the fifth participles; the sixth participle comprises a fifth participle belonging to an existing entry; a seventh participle obtaining subunit, configured to determine a seventh participle from the sixth participles; the seventh participle is not included in the sixth participles except the sixth participle; and the keyword acquisition subunit is used for acquiring the keywords of the text to be processed according to the seventh word segmentation.

In one embodiment, the semantic description information obtaining module 1304 may include the following units: a network search information acquisition unit for acquiring network search information corresponding to each keyword; the network search information corresponding to the keyword is obtained by searching the keyword through network search service; and the semantic description information acquisition unit is used for respectively obtaining each semantic description information corresponding to each keyword according to the network search information corresponding to each keyword.

In one embodiment, the network search information acquisition unit may include the following sub-units: the candidate keyword searching subunit is used for searching candidate keywords corresponding to the keywords in the local information base respectively; recording a matching relation between the candidate keywords and candidate search information by a local information base, wherein the candidate search information is obtained by searching corresponding candidate keywords through a network search service; the network search information reading subunit is used for obtaining the network search information corresponding to the keyword according to the candidate search information matched with the searched candidate keyword when the candidate keyword corresponding to the keyword is searched; and the network search information searching subunit is used for calling network search service to search the keyword when the candidate keyword corresponding to the keyword is not found, so as to obtain the network search information corresponding to the keyword.

In one embodiment, the apparatus 1300 for determining text category may further include the following modules: and the priority coefficient acquisition module is used for acquiring the priority coefficient of each candidate category. Accordingly, the text category determination module 1310 may include the following elements: a fifth correlation determining unit, configured to determine fifth correlations between the text to be processed and each candidate category according to each second correlation and the priority coefficient of each candidate category; and the text category determining unit is used for determining the category to which the text to be processed belongs from the candidate categories according to the fifth relevance.

In one embodiment, the first correlation determination module 1306 may include the following: the third correlation determining unit is used for determining the third correlation between each semantic description information and each candidate category according to each semantic description information and the category name of each candidate category; and the first correlation determining unit is used for determining the first correlation between each keyword and each candidate category according to each third correlation.

In one embodiment, the third correlation determination unit may include the following sub-units: the common word determining subunit is used for determining common words of the semantic description information and the category names of the candidate categories according to the semantic description information and the category names of the candidate categories; the target common word determining subunit is used for determining target common words of the semantic description information and the category names of the candidate categories from the common words of the semantic description information and the category names of the candidate categories; a third proportion determining subunit, configured to determine a third proportion that the total word length of each semantic description information and the target common word of the category name of each candidate category accounts for the total word length of the category name of each candidate category; a third relevancy determining subunit, configured to determine third relevancy between each semantic description information and each candidate category name according to each third ratio, a first word frequency of each semantic description information and each target common word of the category name of each candidate category in the corresponding semantic description information, and a first inverse document frequency of each semantic description information and each target common word of the category name of each candidate category; the target common words of the semantic description information and the category names of the candidate categories are not included in the common words except the target common words of the semantic description information and the category names of the candidate categories.

In one embodiment, the apparatus 1300 for determining text category may further include the following modules: and the fourth correlation degree determining module is used for determining fourth correlation degrees of each semantic description information and each candidate category according to each semantic description information, the preset category associated word of each candidate category and the preset correlation coefficient of the preset category associated word of each candidate category and the corresponding candidate category. Accordingly, the first correlation determining unit is configured to determine the first correlation between each keyword and each candidate category according to each third correlation and each fourth correlation; the preset category relevant words of the candidate categories and the correlation coefficients of the preset category relevant words of the candidate categories and the candidate categories are determined manually.

In one embodiment, the fourth relevance determining module is configured to determine the fourth relevance of each semantic description information to each candidate category according to the second word frequency of the predetermined category relevant word of each candidate category in each semantic description information, the second inverse document frequency of the predetermined category relevant word of each candidate category, and the predetermined relevance coefficient of the predetermined category relevant word of each candidate category to the corresponding candidate category.

It should be noted that, for specific limitations of the device 1300 for determining text categories, reference may be made to the above limitations of the method for determining text categories, and details are not repeated here. The modules in the text category determination apparatus 1300 may be implemented in whole or in part by software, hardware, or a combination thereof. The modules can be embedded in a hardware form or independent of a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In an embodiment, a computer device is provided, comprising a memory and a processor, the memory storing a computer program, which when executed by the processor, causes the processor to perform the steps of the method for determining text categories as provided in any of the embodiments of the present application.

In particular, the computer device may be the server 120 in fig. 1. As shown in fig. 14, the computer apparatus includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor is configured to provide computational and control capabilities. The memory includes a nonvolatile storage medium storing an operating system and a computer program, and an internal memory providing an environment for the operating system and the computer program in the nonvolatile storage medium to run. The network interface is used for communicating with an external terminal through network connection. The computer program is executed by a processor to implement the method for determining text categories provided in any of the embodiments of the present application.

Alternatively, the computer device may be the terminal 110 in fig. 1. As shown in fig. 15, the computer apparatus includes a processor, a memory, a network interface, an input device, and a display screen, which are connected through a system bus. The memory comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system, and may further store a computer program, which when executed by the processor, may cause the processor to implement the method for determining text categories provided in any embodiment of the present application. The internal memory may also store a computer program, which when executed by the processor, causes the processor to perform the method for determining text categories as provided in any of the embodiments of the present application. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the configurations shown in fig. 14 and 15 are block diagrams of only some of the configurations relevant to the present application, and do not constitute a limitation on the computing devices to which the present application may be applied, and a particular computing device may include more or less components than those shown, or some of the components may be combined, or have a different arrangement of components.

In one embodiment, the apparatus 1300 for determining a text category provided in the embodiments of the present application may be implemented in the form of a computer program, and the computer program may be executed on a computer device as shown in fig. 14 or fig. 15. The memory of the computer device may store various program modules constituting the text-type object determining apparatus 1300, such as a keyword processing module 1302, a semantic description information obtaining module 1304, a first relevance determining module 1306, and the like shown in fig. 13. The program modules constitute computer programs that cause the processor to execute the steps of the method for determining text categories of videos of the embodiments of the present application described in the present specification. For example, the computer device shown in fig. 14 or fig. 15 may execute step S202 by the keyword processing module 1302 in the text category determination apparatus 1300 shown in fig. 13, execute step S204 by the semantic description information acquisition module 1304, and the like.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by a computer program, which may be stored in a non-volatile computer readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct Rambus Dynamic RAM (DRDRAM), and Rambus Dynamic RAM (RDRAM), among others.

Accordingly, in an embodiment, a computer readable storage medium is provided, storing a computer program, which when executed by a processor, causes the processor to perform the steps of the method for determining text categories provided in any of the embodiments of the present application.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of determining a text category, comprising:

extracting keywords of a text to be processed, and determining the weight of each keyword, wherein the keywords are words in the text to be processed and used for representing the theme of the text to be processed, and the weight represents the importance degree of the corresponding keyword to the text to be processed;

obtaining semantic description information corresponding to each keyword, wherein the semantic description information is used for helping to understand meanings expressed by the keywords;

determining a first degree of correlation between each keyword and each candidate category according to the semantic description information and category description information of each candidate category, wherein the category description information is used for reflecting the characteristic information of the candidate category, and the first degree of correlation is a metric value used for measuring the matching degree between the keyword and the candidate category;

determining second relevance of the text to be processed and each candidate category according to the weight of each keyword and each first relevance, wherein the second relevance is a metric for measuring the matching degree between the text to be processed and the candidate category;

2. The method according to claim 1, wherein the extracting keywords of the text to be processed comprises:

performing word segmentation processing on the text to be processed to obtain a plurality of first word segments of the text to be processed;

removing first participles belonging to a target filtering word bank from each first participle to obtain one or more second participles; the second segmentation comprises the first segmentation left after the removal;

obtaining keywords of the text to be processed according to the second word segmentation;

the target filtering word bank comprises a filtering word bank corresponding to a target data source, and the target data source comprises a data source to which the text to be processed belongs.

3. The method of claim 2, wherein the means for constructing the target filtered thesaurus comprises:

performing word segmentation processing on each text belonging to the target data source to obtain a plurality of third words;

respectively determining a first proportion corresponding to each third participle; the first proportion corresponding to the third participle comprises: the target data source comprises the proportion of the number of texts containing the third word segmentation to the total number of texts of the target data source;

and constructing the target filtering word bank according to the third participle of which the first proportion exceeds a first proportion threshold value.

4. The method of claim 3, wherein determining the first scaling threshold comprises:

determining a fourth word segmentation from each third word segmentation according to the current proportion threshold; the fourth participle comprises a third participle of which the first ratio is equal to or greater than the current ratio threshold;

determining the residual word number respectively corresponding to each text belonging to the target data source; the residual word number corresponding to the text is the number of the third participles left after the fourth participle is removed from each third participle of the text;

determining a second proportion of the number of texts of which the remaining number of words is equal to or greater than the threshold number of words to the total number of texts belonging to the target data source;

when the second proportion does not exceed a second proportion threshold, determining the current proportion threshold as the first proportion threshold;

and when the second proportion exceeds the second proportion threshold, updating the current proportion threshold according to a down-regulation numerical value, and returning to the step of determining a fourth word segmentation from all the third word segmentations according to the current proportion threshold.

5. The method according to claim 2, wherein obtaining the keywords of the text to be processed according to the second segmentation includes:

performing permutation and combination according to the second participles to obtain fifth participles; each fifth participle comprises at least two second participles which are continuously adjacent;

determining a sixth participle from each of the fifth participles; the sixth participle comprises a fifth participle belonging to an existing entry;

determining a seventh participle from each of the sixth participles; the seventh participle is not included in sixth participles except the sixth participle;

and obtaining the keywords of the text to be processed according to the seventh word segmentation.

6. The method according to claim 1, wherein the obtaining semantic description information corresponding to the keywords comprises:

acquiring network search information corresponding to each keyword; the network search information corresponding to the keyword is obtained by searching the keyword through network search service;

and obtaining semantic description information respectively corresponding to the keywords according to the network search information respectively corresponding to the keywords.

7. The method of claim 6, wherein the obtaining of the web search information corresponding to each of the keywords comprises:

respectively searching candidate keywords corresponding to the keywords in a local information base; the local information base records the matching relation between candidate keywords and candidate search information, and the candidate search information is obtained by searching corresponding candidate keywords through the network search service;

when the candidate keywords corresponding to the keywords are found, obtaining network search information corresponding to the keywords according to the candidate search information matched with the found candidate keywords;

and when the candidate keyword corresponding to the keyword is not found, calling the network search service to search the keyword to obtain network search information corresponding to the keyword.

8. The method of claim 1, further comprising:

acquiring a priority coefficient of each candidate category;

determining the category to which the text to be processed belongs from the candidate categories according to the second relevance, wherein the determining comprises:

determining fifth relevance of the text to be processed and each candidate category according to each second relevance and the priority coefficient of each candidate category;

and determining the category to which the text to be processed belongs from the candidate categories according to the fifth relevance.

9. The method according to any one of claims 1 to 8, wherein the determining a first degree of relevance between each keyword and each candidate category according to each semantic description information and each candidate category description information comprises:

determining a third degree of correlation between each semantic description information and each candidate category according to each semantic description information and the category name of each candidate category;

and determining first relevance of each keyword and each candidate category according to each third relevance.

10. The method of claim 9, wherein determining a third degree of correlation between each semantic description information and each candidate category according to each semantic description information and a category name of each candidate category comprises:

determining common words of the semantic description information and the category names of the candidate categories according to the semantic description information and the category names of the candidate categories;

determining target common words of the semantic description information and the category names of the candidate categories from the common words of the semantic description information and the category names of the candidate categories;

determining a third proportion of the total word length of the semantic description information and a target common word of the category name of each candidate category to the total word length of the category name of each candidate category;

determining a third degree of correlation between each semantic description information and the category name of each candidate category according to each third proportion, a first word frequency of each semantic description information and a target common word of the category name of each candidate category in corresponding semantic description information, and a first inverse document frequency of each semantic description information and a target common word of the category name of each candidate category;

the semantic description information and the target common words of the category names of the candidate categories are not included in the common words except the target common words of the semantic description information and the category names of the candidate categories.

11. The method of claim 9, further comprising:

determining a fourth degree of correlation between each semantic description information and each candidate category according to each semantic description information, a predetermined category associated word of each candidate category and a predetermined correlation coefficient between the predetermined category associated word of each candidate category and the corresponding candidate category;

determining the first relevance between each keyword and each candidate category according to each third relevance, including:

and determining first relevance of each keyword and each candidate category according to each third relevance and each fourth relevance.

12. The method of claim 11, wherein the determining a fourth correlation degree between each semantic description information and each candidate category according to each semantic description information, a predetermined category associated word of each candidate category, and a predetermined correlation coefficient between a predetermined category associated word of each candidate category and the corresponding candidate category comprises:

and determining fourth degree of correlation between each semantic description information and each candidate category according to a second word frequency of a predetermined category associated word of each candidate category in each semantic description information, a second inverse document frequency of the predetermined category associated word of each candidate category and a predetermined correlation coefficient between the predetermined category associated word of each candidate category and the corresponding candidate category.

13. An apparatus for determining a text category, comprising:

the keyword processing module is used for extracting keywords of a text to be processed and determining the weight of each keyword, wherein the keywords are words used for representing the theme idea of the text to be processed in the text to be processed, and the weight represents the importance degree of the corresponding keyword to the text to be processed;

a semantic description information acquisition module, configured to acquire semantic description information corresponding to each of the keywords, where the semantic description information is information used to help understand meanings expressed by the keywords;

a first relevancy determination module, configured to determine, according to each semantic description information and category description information of each candidate category, first relevancy between each keyword and each candidate category, where the category description information is used to reflect information of characteristics of the candidate category, and the first relevancy is a metric used to measure a matching degree between the keyword and the candidate category;

a second relevance determining module, configured to determine second relevance between the text to be processed and each candidate category according to the weight of each keyword and each first relevance, where the second relevance is a metric for measuring a matching degree between the text to be processed and the candidate category;

and the text category determining module is used for determining the category to which the text to be processed belongs from the candidate categories according to the second correlation degrees.

14. The apparatus of claim 13, wherein the keyword processing module comprises a first segmentation obtaining unit, a second segmentation obtaining unit and a keyword obtaining unit;

the first word segmentation acquiring unit is used for carrying out word segmentation processing on the text to be processed to obtain a plurality of first words segmentation of the text to be processed;

the second participle obtaining unit is used for removing the first participles belonging to the target filtering word bank from each first participle to obtain one or more second participles; the second segmentation comprises the first segmentation left after the removal;

the keyword obtaining unit is used for obtaining keywords of the text to be processed according to the second word segmentation; the target filtering word bank comprises a filtering word bank corresponding to a target data source, and the target data source comprises a data source to which the text to be processed belongs.

15. The apparatus according to claim 14, wherein the apparatus for determining the text category further comprises a third segmentation obtaining module, a first proportion determining module and a target filtering lexicon constructing module;

the third word segmentation acquisition module is used for carrying out word segmentation processing on each text belonging to the target data source to obtain a plurality of third words;

the first proportion determining module is configured to determine a first proportion corresponding to each third participle respectively; the first proportion corresponding to the third participle comprises: the target data source comprises the proportion of the number of texts containing the third word segmentation to the total number of texts of the target data source;

and the target filtering word bank building module is used for building the target filtering word bank according to the third participle of which the first proportion exceeds a first proportion threshold value.

16. The apparatus of claim 15, wherein the means for determining the text category further comprises a first scale threshold determination module;

the first proportion threshold determining module is configured to determine a fourth word from each third word according to the current proportion threshold; the fourth participles comprise third participles for which the first ratio is equal to or greater than the current ratio threshold; determining the residual word number respectively corresponding to each text belonging to the target data source; the residual word number corresponding to the text is the number of the third participles left after the fourth participle is removed from each third participle of the text; determining a second proportion of the number of texts of which the remaining number of words is equal to or greater than the threshold number of words to the total number of texts belonging to the target data source; determining the current proportion threshold as the first proportion threshold when the second proportion does not exceed a second proportion threshold; and when the second proportion exceeds the second proportion threshold, updating the current proportion threshold according to a down-regulation value, and returning to the step of determining a fourth word from each third word according to the current proportion threshold.

17. The apparatus according to claim 14, wherein the keyword obtaining unit includes a fifth participle obtaining subunit, a sixth participle obtaining subunit, a seventh participle obtaining subunit, and a keyword obtaining subunit;

the fifth participle obtaining subunit is used for carrying out permutation and combination according to the second participles to obtain fifth participles; each fifth participle comprises at least two continuous adjacent second participles;

the sixth segmentation obtaining subunit is configured to determine a sixth segmentation from each of the fifth segmentations; the sixth participle comprises a fifth participle belonging to an existing entry;

the seventh participle obtaining subunit is configured to determine a seventh participle from each of the sixth participles; the seventh participle is not included in sixth participles except the sixth participle;

and the keyword obtaining subunit is configured to obtain a keyword of the text to be processed according to the seventh word segmentation.

18. The apparatus according to claim 13, wherein the semantic description information acquisition module comprises a web search information acquisition unit and a semantic description information acquisition unit;

the network search information acquisition unit is used for acquiring network search information corresponding to each keyword; the network search information corresponding to the keyword is obtained by searching the keyword through network search service;

the semantic description information acquisition unit is used for respectively obtaining each semantic description information corresponding to each keyword according to the network search information corresponding to each keyword.

19. The apparatus of claim 18, wherein the network search information obtaining unit comprises a candidate keyword searching subunit, a network search information reading subunit, and a network search information searching subunit;

the candidate keyword searching subunit is configured to search, in a local information base, candidate keywords corresponding to the keywords respectively; the local information base records the matching relation between candidate keywords and candidate search information, and the candidate search information is obtained by searching corresponding candidate keywords through the network search service;

the network search information reading subunit is used for obtaining the network search information corresponding to the keyword according to the candidate search information matched with the searched candidate keyword when the candidate keyword corresponding to the keyword is searched;

and the network search information searching subunit is used for calling the network search service to search the keyword when the candidate keyword corresponding to the keyword is not found, so as to obtain the network search information corresponding to the keyword.

20. The apparatus of claim 13, wherein the means for determining the text category further comprises a priority coefficient obtaining module;

the priority coefficient acquisition module is used for acquiring the priority coefficient of each candidate category;

the text category determining module comprises a fifth relevancy determining unit and a text category determining unit;

the fifth relevance determining unit is configured to determine fifth relevance of the text to be processed and each of the candidate categories according to each of the second relevance and the priority coefficient of each of the candidate categories;

and the text category determining unit is configured to determine, according to each fifth correlation degree, a category to which the text to be processed belongs from each candidate category.

21. The apparatus according to any one of claims 13 to 20, wherein the first correlation determination module comprises a third correlation determination unit and a first correlation determination unit;

the third relevance determining unit is configured to determine, according to each semantic description information and a category name of each candidate category, a third relevance between each semantic description information and each candidate category;

and the first relevance determining unit is used for determining the first relevance of each keyword and each candidate category according to each third relevance.

22. The apparatus of claim 21, wherein the third correlation determining unit comprises a common word determining subunit, a target common word determining subunit, a third ratio determining subunit, and a third correlation determining subunit;

the common word determining subunit is configured to determine, according to each semantic description information and the category name of each candidate category, a common word between each semantic description information and the category name of each candidate category;

the target common word determining subunit is configured to determine, from the common words between each semantic description information and the category name of each candidate category, a target common word between each semantic description information and the category name of each candidate category;

the third proportion determining subunit is configured to determine a third proportion that the total word length of each semantic description information and the target common word of the category name of each candidate category accounts for the total word length of the category name of each candidate category;

the third relevance determining subunit is configured to determine third relevance between each piece of semantic description information and each candidate category name according to each third proportion, a first word frequency of each piece of semantic description information in corresponding semantic description information with a target common word of each candidate category name, and a first inverse document frequency of each piece of semantic description information in corresponding target common word of each candidate category name; the semantic description information and the target common words of the category names of the candidate categories are not included in the common words except the target common words of the semantic description information and the category names of the candidate categories.

23. The apparatus of claim 21, wherein the means for determining the text category further comprises a fourth relevance determining module;

the fourth correlation degree determining module is configured to determine a fourth correlation degree between each semantic description information and each candidate category according to each semantic description information, a predetermined category associated word of each candidate category, and a predetermined correlation coefficient between a predetermined category associated word of each candidate category and a corresponding candidate category;

and the first relevance determining unit is configured to determine, according to each third relevance and each fourth relevance, a first relevance between each keyword and each candidate category.

24. The apparatus according to claim 23, wherein the fourth correlation determining module is configured to determine the fourth correlation between each semantic description information and each candidate category according to the second word frequency of the predetermined category related word of each candidate category in each semantic description information, the second inverse document frequency of the predetermined category related word of each candidate category, and the predetermined correlation coefficient between the predetermined category related word of each candidate category and the corresponding candidate category.

25. A computer-readable storage medium, storing a computer program which, when executed by a processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 12.

26. A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the method of any one of claims 1 to 12.