CN111353020A

CN111353020A - Method, device, computer equipment and storage medium for mining text data

Info

Publication number: CN111353020A
Application number: CN202010124827.9A
Authority: CN
Inventors: 王文超; 阳任科; 郏昕
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2020-02-27
Filing date: 2020-02-27
Publication date: 2020-06-30
Anticipated expiration: 2040-02-27
Also published as: CN111353020B

Abstract

The application relates to a method, a device, computer equipment and a storage medium for mining text data. The method comprises the following steps: acquiring text data including candidate character strings; calculating the score of the candidate character string according to the word forming function to obtain a word forming score; screening out character strings with the word score larger than or equal to a first preset threshold value from the candidate character strings as first candidate character strings; searching character strings matched with a preset dictionary from the first candidate character string to serve as a first target character string, and using character strings which are not matched to serve as second character strings; finding out character strings with the word composition scores larger than a second preset threshold value from the second character strings as second target character strings; and taking the category corresponding to the target character string searched in the preset knowledge base as the target category of the target character string. And the works are automatically divided, the word forming scores of the divided character strings are determined as target character strings, and real-time classification is carried out, so that the dividing speed of the character strings is ensured, and the character strings are well classified.

Description

Method, device, computer equipment and storage medium for mining text data

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for mining text data, a computer device, and a storage medium.

Background

And text data corresponding to literary works, current affair reports and other works. The existing methods for analyzing props and characters in works basically adopt a dictionary matching-based method. When the dictionary matching method is used, the dictionary is required to contain all words and phrases, and the conventional prop dictionary and actor dictionary cannot cover prop elements and character elements corresponding to works 100%, so that the prop elements and character elements in the works cannot be accurately identified when the works are subjected to data mining, a data mining result is influenced, text data is artificially labeled, and the prop elements and character elements in the text data can be accurately identified, but the efficiency is low.

Disclosure of Invention

In order to solve the technical problem, the application provides a method, a device, a computer device and a storage medium for mining text data.

In a first aspect, the present application provides a method for mining text data, including:

acquiring text data, wherein the text data comprises a plurality of candidate character strings;

calculating the score of each candidate character string according to a preset word function to obtain the word composition score of each candidate character string;

screening out character strings with the word score larger than or equal to a first preset threshold value from the candidate character strings as first candidate character strings;

searching a character string matched with a character string in a preset dictionary from the first candidate character string to serve as a first target character string, and using an unmatched character string as a second character string;

finding out character strings with the word score larger than a second preset threshold value from the second character string as a second target character string, wherein the first target character string and the second target character string form a target character string;

searching whether the target character string is located in a preset knowledge base or not, wherein the preset knowledge base comprises the character string and a corresponding category;

and when the target character string is located in the preset knowledge base, taking the category of the target character string in the preset knowledge base as the target category of the target character string.

In a second aspect, the present application provides an apparatus for mining text data, comprising:

the data acquisition module is used for acquiring text data, and the text data comprises a plurality of candidate character strings;

the component word score calculation module is used for calculating the score of each candidate character string according to a preset word function to obtain the component word score of each candidate character string;

the first screening module is used for screening out character strings with the component word scores larger than or equal to a first preset threshold value from the candidate character strings as first candidate character strings;

the dictionary inquiring module is used for searching a character string matched with a character string in a preset dictionary from the first candidate character string to be used as a first target character string, and an unmatched character string is used as a second character string;

the second screening module is used for searching out character strings with the word score larger than a second preset threshold value from the second character strings as second target character strings, and the first target character strings and the second target character strings form target character strings;

the knowledge base query module is used for searching whether the target character string is located in a preset knowledge base or not, and the preset knowledge base comprises the character string and a corresponding category;

and the classification module is used for taking the category of the target character string in the preset knowledge base as the target category of the target character string when the target character string is positioned in the preset knowledge base.

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

The method, the device, the computer equipment and the storage medium for mining the text data comprise the following steps: acquiring text data, wherein the text data comprises a plurality of candidate character strings; calculating the score of each candidate character string according to a preset word function to obtain the word composition score of each candidate character string; screening out character strings with the word score larger than or equal to a first preset threshold value from the candidate character strings as first candidate character strings; searching a character string matched with a character string in a preset dictionary from the first candidate character string to serve as a first target character string, and using an unmatched character string as a second character string; finding out character strings with the word score larger than a second preset threshold value from the second character string as a second target character string, wherein the first target character string and the second target character string form a target character string; searching whether the target character string is located in a preset knowledge base or not, wherein the preset knowledge base comprises the character string and a corresponding category; and when the target character string is located in the preset knowledge base, taking the category of the target character string in the preset knowledge base as the target category of the target character string. And the works are automatically divided, the word forming scores of the divided character strings are determined as target character strings, and real-time classification is carried out, so that the dividing speed of the character strings is ensured, and the character strings are well classified.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

FIG. 1 is a diagram of an application environment for a method of mining text data in one embodiment;

FIG. 2 is a flow diagram that illustrates a method for mining text data, according to one embodiment;

FIG. 3 is a flow diagram illustrating a method for mining text data in an exemplary embodiment;

FIG. 4 is a block diagram of an apparatus for mining text data in one embodiment;

FIG. 5 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

FIG. 1 is a diagram of an application environment for a method of mining text data in one embodiment. Referring to fig. 1, the method of mining text data is applied to a system of mining text data. The system for mining text data includes a terminal 110 and a server 120. The terminal 110 and the server 120 are connected through a network. The terminal 110 or the server 120 acquires text data, the text data including a plurality of candidate character strings; screening out character strings with the word score larger than or equal to a first preset threshold value from the candidate character strings as first candidate character strings; searching a character string matched with a character string in a preset dictionary from the first candidate character string to serve as a first target character string, and using an unmatched character string as a second character string; finding out character strings with the word score larger than a second preset threshold value from the second character string as a second target character string, wherein the first target character string and the second target character string form a target character string; searching whether the target character string is located in a preset knowledge base or not, wherein the preset knowledge base comprises the character string and a corresponding category; and when the target character string is located in the preset knowledge base, taking the category of the target character string in the preset knowledge base as the target category of the target character string.

The terminal 110 may specifically be a desktop terminal or a mobile terminal, and the mobile terminal may specifically be at least one of a mobile phone, a tablet computer, a notebook computer, and the like. The server 120 may be implemented as a stand-alone server or a server cluster composed of a plurality of servers.

As shown in FIG. 2, in one embodiment, a method of mining text data is provided. The embodiment is mainly illustrated by applying the method to the terminal 110 (or the server 120) in fig. 1. Referring to fig. 2, the method for mining text data specifically includes the following steps:

in step S201, text data is acquired.

In the present embodiment, the text data includes a plurality of candidate character strings.

Specifically, the text data includes literary works, news reports, and the like, wherein the literary works include, but are not limited to, a script, a novel. The candidate character strings are character strings obtained by dividing text data according to a preset rule, and the character strings comprise character strings consisting of single characters, character strings consisting of single punctuation marks, character strings consisting of a plurality of characters, character strings consisting of characters and punctuation marks and the like. When dividing the character strings, dividing the text data according to a preset window, wherein the window length of the preset sliding window can be customized, for example, the window length is defined as 4 or 5 character strings.

In a specific embodiment, the text data is "like to eat apple", and the length 5 is used as the window length of the sliding window, and the obtained character strings include "like", "happy", "eating", "apple", "like to eat apple" and "like to eat apple".

Step S202, calculating the score of each candidate character string according to a preset word function to obtain the word composition score of each candidate character string.

Specifically, the preset word-forming function includes a custom function and a common word-forming function. The common word-forming functions include mutual information, left and right entropies, extension functions of the mutual information, functions formed by the mutual information and the left and right entropies, and the like. The mutual information is used for calculating the internal solidification degree between two adjacent character strings, and the larger the mutual information is, the higher the internal solidification degree of the two character strings is, and the larger the word forming probability is. The left entropy and the right entropy are used for representing the free application degree of a character string consisting of two character unit strings. Larger entropy indicates richer surrounding words. And calculating the score of the component word according to the mutual information and the left-right entropy obtained by calculation, for example, directly using the mutual information as the score of the component word, using the minimum value in the left-right entropy as the score of the component word, or using the product of the mutual information and the minimum value in the left-right entropy as the score of the component word, and the like.

Step S203, a character string with a component word score greater than or equal to a first preset threshold is screened out from the candidate character string as a first candidate character string.

Specifically, the first preset threshold is a preset critical value for screening character strings, and candidate character strings with component word scores greater than or equal to the first preset threshold are selected as the first candidate character strings. The first preset threshold may be obtained by analyzing a preset word-forming function, or may be an empirical value of a technician, or the like.

In step S204, a character string matching a character string in a preset dictionary is searched from the first candidate character string as a first target character string.

Specifically, the first candidate character string is matched by using a preset dictionary, the matched first candidate character string found in the preset dictionary is used as a first target character string, and the matched first candidate character string which cannot be found in the preset dictionary is used as a second character string. The preset dictionary may be a common dictionary for performing dictionary matching. Such as a chinese dictionary, size 398.18 ten thousand.

In step S205, a character string with a word score greater than a second preset threshold is found from the second character string as a second target character string.

In the present specific embodiment, the first target character string and the second target character string constitute a target character string.

Step S206, find whether the target character string is located in the preset knowledge base.

In this embodiment, the predetermined knowledge base includes character strings and corresponding categories.

Step S207, when the target character string is located in the preset knowledge base, taking the category of the target character string in the preset knowledge base as the target category of the target character string.

Specifically, the second preset threshold is greater than the first preset threshold, the second preset threshold and the component word scores of the second character strings are used for screening, and the character strings of which the component word scores are greater than the second preset threshold in the second character strings are used as second target character strings. The first target character string and the second target character string constitute a target character string. The preset knowledge base refers to a common knowledge graph, the knowledge graph comprises relations among different words, such as categories to which the words belong, such as words with the categories being cars including cars, trucks, jeep, etc., and the cars correspond to cars with different models and brands, etc. Common knowledge bases include the HowNet knowledge base. And searching the category of each target character string in a knowledge base, and taking the category of the target character string searched in the knowledge base as the target category of the target character string, wherein the target category comprises two categories, namely props and actor roles.

The method for mining text data comprises the following steps: acquiring text data, wherein the text data comprises a plurality of candidate character strings; calculating the score of each candidate character string according to a preset word function to obtain the word composition score of each candidate character string; screening out character strings with the word score larger than or equal to a first preset threshold value from the candidate character strings as first candidate character strings; searching a character string matched with a character string in a preset dictionary from the first candidate character string to serve as a first target character string, and using an unmatched character string as a second character string; finding out character strings with the word score larger than a second preset threshold value from the second character string as a second target character string, wherein the first target character string and the second target character string form a target character string; searching whether the target character string is located in a preset knowledge base or not, wherein the preset knowledge base comprises the character string and a corresponding category; and when the target character string is located in the preset knowledge base, taking the category of the target character string in the preset knowledge base as the target category of the target character string. The text data is directly divided, character strings are screened through the word-forming scores, the screened character strings are classified, and the accuracy and the instantaneity of classification are improved by means of a dictionary and a knowledge base during classification. The literary works and current affair reports have specific property names or actor names, and some special properties created by drama self exist, such as: a Tornado sword, a Tai-A sword, a eight-Diagram meridian axe duck axe, and the like. The method comprises the steps of automatically dividing props and actor roles in works, determining target character strings according to word forming scores of character strings obtained through division, classifying the target character strings in real time, and classifying the target character strings well when the character string division speed is guaranteed, namely, the classification accuracy is guaranteed. Properties and actor roles of the script are identified and applied to play group planning, and the actor roles can be applied to character relations in script evaluation.

In an embodiment, the method for mining text data further includes:

and S208, when the target character string is not located in the preset knowledge base and contains a plurality of characters, performing word segmentation on the target character string to obtain a plurality of character units of each target character string.

In step S209, when the part of speech of the character unit located at the preset position in the character units of the target character string is a noun and the character unit located at the preset position is located in the preset knowledge base, the category of the character unit located at the preset position in the preset knowledge base is used as the target category of the target character string.

Specifically, the character refers to a single character, and the target character string containing a plurality of characters means that the target character string is composed of a plurality of single characters, and if preferred, contains two single characters. A character unit refers to a character string or the like containing one letter, one number or one punctuation mark or a plurality of letters, numbers, or the like. Target strings that are not located in the predetermined knowledge base may contain the author's self-created vocabulary. And performing word segmentation on target character strings which do not exist in a preset knowledge base to obtain a plurality of character units. The method comprises the steps of obtaining a character unit corresponding to a preset position corresponding to a target character string, searching characters matched with the character unit corresponding to the preset position in a preset knowledge base when the part of speech of the character unit corresponding to the preset position is a noun, and taking the type of the matched characters as the target type of the target character string. The preset positions include, but are not limited to, last position, last to last position, and the like. If the preset position is the last position, the last character unit corresponding to the target character string is obtained, when the part of speech of the last character unit is a noun, the character unit matched with the last character unit is searched in the preset knowledge base, and the category of the matched character unit is used as the target category of the target character string.

In an embodiment, the method for mining text data further includes:

and step S210, when the target character string is not located in the preset knowledge base and the last character of the target character string is a preset prop type character, setting the type of the target character string as a prop.

Step S211, when the target character string is not located in the preset knowledge base, the target character string includes a plurality of characters, and the last character of the target character string is not a preset property type character, performing word segmentation on the target character string to obtain a plurality of character units of each target character string.

Specifically, if the target character string is not located in the preset knowledge base, the last character of the target character string is judged, and whether the last character is a preset property character or not is judged. The preset property characters are characters obtained by statistics in advance. And searching whether the last character of the target character string is one of the characters of the preset property types, and if so, setting the type of the target character string as the property. Otherwise, the process proceeds to step S208.

In one embodiment, the target category includes props and actor characters, and step S207 includes:

step S2071, when the category corresponding to the target character string is the first preset category, setting the category of the target character string as a prop.

Step S2072, when the category corresponding to the target character string is an actor character, setting the category of the target character string as the actor character.

Specifically, the first preset category includes common item categories in a plurality of knowledge bases, such as "food", "utensils", "furniture", "vehicle" and "machine". The categories in the knowledge base are not all available as props, so a part of the categories are selected from the categories in the knowledge base to obtain a first preset category. The categories in the knowledge base about a person or a character are directly defined as actor characters.

In one embodiment, the candidate character strings include a character string composed of a single word and a character string composed of a plurality of character units, and step S202 includes: when the candidate character string is a character string consisting of single characters, acquiring a preset value as a component word score of the candidate character string; when the candidate character string is a character string consisting of a plurality of character units, calculating the component word score of the candidate character string by adopting a preset word function to obtain a second score, wherein the preset word function is shown as a formula (1):

Score＝min(HL(W)，HR(W))*PMI(x,y) (1)

where hl (W) is the left entropy of candidate string W,

KL is the number of strings to the left of the candidate string W, pl_kFrequency of occurrence of the k-th character string to the left of the candidate character string W, hr (W) right entropy of the candidate character string W,

KR is the number of character strings to the right of the candidate character string W, pr_kMin ((HL), (W), HR (W)) is the minimum value selected from the left entropy and the right entropy, PMI (x, y) is the mutual information of the candidate character string W,

x is the character unit to the left of the candidate character string W, y is the character unit to the right of the candidate character string W, p (x) is the probability of x occurring alone, p (y) is the probability of y occurring alone, and p (x, y) is the probability of x and y occurring simultaneously, where x precedes y.

In a specific embodiment, the method for mining text data includes:

step S301, obtains word combinations word (candidate character strings) corresponding to the text information and a set WS of component word scores corresponding to each word combination. Each word combination word in WS has a term score, score. The calculation formula (1) of the score of the Chinese word is shown.

Step S302, selecting a character combination with score greater than or equal to 0 (a first preset threshold) to obtain a first candidate character string. And sorting according to score descending order, adding a flag attribute to each word, setting an initial flag as False, wherein the flag represents whether the word is selected as a candidate word or not, and setting the flag as True or False.

Step S303, traversing the sequenced WS according to the existing Chinese dictionary (preset dictionary), setting flag to True if word is in the existing dictionary, obtaining a first target character string, and taking the rest character strings in the first candidate character string as second character strings. And reading the dictionary to obtain a hash table, and judging whether the word in the WS exists in a Chinese dictionary or not by adopting the hash table.

Step S304, setting a word score threshold value threshold (a second preset threshold value), selecting a word with score higher than the threshold value from the second character string, setting a flag of the word to True to obtain a second target character string, and forming a target character string by the first target character string and the second target character string to obtain a word set with the flag of True, namely a set WP of the target character string.

In step S305, it is determined whether word in WP exists in HowNet. In WP, if word exists in HowNet (preset knowledge base), find the category TW of word in HowNet. If the word is not in the HowNet, step S307 is executed in step S306.

In step S306, the target category of the target character string is set. A category dictionary PD of the item is preset, such as 'food', 'appliance', 'furniture', 'vehicle', 'machine', and the like, and if TW and PD have an intersection and the intersection is not an empty set, the word is judged as the item. If there are only 'people' in TW, then word is judged as a group show.

And step S307, judging whether the word suffix is the prop suffix. And summarizing dictionaries with larger suffix probability of some prop words, such as 'machine', 'device' and the like, matching through rules, and if matching is successful, judging the words as props. If the prop suffix is found, step S308 is executed, otherwise step S309 is executed.

And step S308, determining the prop from the word.

And step S309, performing word segmentation on the word, and determining the target type of the target character string according to the word segmentation result. If the last word of the word segmentation result is the part of speech of a noun and the category of the last word is judged in HowNet according to the above, the category of the last word is selected as the category of the word because Chinese generally expresses the emphasis on the last, for example, the category of 'fruit ninja' and 'ninja' represents the category of the whole word.

The system comprises four modules of calculating word-forming scores based on information entropy and mutual information, extracting words based on credibility of the word-forming scores, mining knowledge base classes and judging property affix classes. The component word score is calculated based on the information entropy and the mutual information, and if the component word score of a character combination is higher, the probability that the combination is a word in the script is higher; and then, judging that a part of the candidate words are added into the known words through a known Chinese word library, selecting word combinations with word scores above a certain threshold value as the candidate words for other character combinations, mining character identity words in the candidate words as character words according to a HowNet knowledge base, mining character identity words in the candidate words as prop words, and judging whether the prop words are prop words or not according to word segmentation, mining and suffix matching for prop words not in HowNet. Through a large number of script experiments, the script prop and actor role mining method based on the information entropy and the mutual information has a good effect, the prop recall rate and the accuracy rate are over 90%, and the actor role recall rate and the accuracy rate are over 80%.

The method has the advantages that the method calculates the word forming probability of the character combination by using a statistical method, performs HowNet word class analysis by combining word segmentation to identify the prop and the actor role of the script, does not need a training set, has the advantages of identifying the self-defined prop and the self-defined actor role of the script and the like compared with a dictionary matching method, and has strong robustness when being applied to original works such as the script. The algorithm can calculate the props and the actor roles in a complete script in the grading time, and the timeliness is very high compared with other algorithms.

Fig. 2 or fig. 3 is a flowchart illustrating a method of mining text data in one embodiment. It should be understood that, although the steps in the flowcharts of fig. 2 or 3 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 2 or 3 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 4, there is provided an apparatus 200 for mining text data, comprising:

the data obtaining module 201 is configured to obtain text data, where the text data includes a plurality of candidate character strings.

And the component word score calculating module 202 is configured to calculate scores of the candidate character strings according to a preset word function, so as to obtain component word scores of the candidate character strings.

The first screening module 203 is configured to screen out, from the candidate character strings, character strings with the word score greater than or equal to a first preset threshold as first candidate character strings.

And the dictionary querying module 204 is configured to search the first candidate character string for a character string matched with a character string in a preset dictionary as a first target character string, and search the unmatched character string as a second character string.

And the second screening module 205 is configured to find out a character string with a component word score larger than a second preset threshold from the second character string as a second target character string, where the first target character string and the second target character string form the target character string.

The knowledge base query module 206 is configured to search whether the target character string is located in a preset knowledge base, where the preset knowledge base includes the character string and a corresponding category.

The classification module 207 is configured to, when the target character string is located in the preset knowledge base, use a category of the target character string in the preset knowledge base as a target category of the target character string.

In an embodiment, the apparatus 200 for mining text data further includes:

and the word segmentation module is used for segmenting the target character string to obtain a plurality of character units of each target character string when the target character string is not located in the preset knowledge base and contains a plurality of characters.

The classification module 207 is further configured to, when the part of speech of the character unit located at the preset position in the character units of the target character string is a noun and the character unit located at the preset position is located in the preset knowledge base, use the category of the character unit located at the preset position in the preset knowledge base as the target category of the target character string.

In one embodiment, the classification module 207 is further configured to set the category of the target character string as a property when the target character string is not located in the preset knowledge base and a last character of the target character string is a preset property type character; and when the target character string is not located in the preset knowledge base, the target character string comprises a plurality of characters, and the last character of the target character string is not a preset prop character, performing word segmentation on the target character string to obtain a plurality of character units of each target character string.

In one embodiment, the classification module 207 is further configured to set, when the target character string is located in the preset knowledge base, a category of the target character string as a prop when the category corresponding to the target character string is a first preset category; and when the category corresponding to the target character string is the actor role, setting the category of the target character string as the actor role, wherein the target category comprises the props and the actor role.

In one embodiment, the component score calculation module 202 is specifically configured to, when a candidate character string is a character string composed of single characters, obtain a preset value as a component score of the candidate character string; when the candidate character string is a character string consisting of a plurality of character units, calculating the component word score of the candidate character string by adopting a preset word function to obtain a second score, wherein the preset word function is as follows:

Score＝min(HL(W)，HR(W))*PMI(x,y)

wherein HL (W) is the left entropy of the candidate character string W,

x is the character unit on the left side of the candidate character string W, y is the character unit on the right side of the candidate character string W, p (x) is the probability of x occurring alone, p (y) is the probability of y occurring alone, and p (x, y) is the probability of x and y occurring simultaneously.

FIG. 5 is a diagram illustrating an internal structure of a computer device in one embodiment. The computer device may specifically be the terminal 110 (or the server 120) in fig. 1. As shown in fig. 5, the computer apparatus includes a processor, a memory, a network interface, an input device, and a display screen connected via a system bus. Wherein the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and may also store a computer program that, when executed by the processor, causes the processor to implement a method of mining text data. The internal memory may also have stored therein a computer program that, when executed by the processor, causes the processor to perform a method of mining text data. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 5 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, the apparatus for mining text data provided herein may be implemented in the form of a computer program that is executable on a computer device such as that shown in fig. 5. The memory of the computer device may store various program modules constituting the apparatus for mining text data, such as a data acquisition module 201, a component score calculation module 202, a first filtering module 203, a dictionary query module 204, a second filtering module 205, a knowledge base query module 206, and a classification module 207 shown in fig. 4. The program modules constitute computer programs to make the processors execute the steps of the text data mining method of the embodiments of the present application described in the present specification.

For example, the computer device shown in fig. 5 may perform the acquiring of the text data by the data acquisition module 201 in the apparatus for mining text data shown in fig. 4, the text data including a plurality of candidate character strings. The computer device may perform, by the participle score calculation module 202, calculation of scores of the respective candidate character strings according to a preset word function, to obtain participle scores of the respective candidate character strings. The computer apparatus may perform, as the first candidate character string, the filtering out of the candidate character strings by the first filtering module 203, a character string whose component word score is greater than or equal to a first preset threshold value. The computer device may perform, through the dictionary lookup module 204, a search of a character string matching a character string in a preset dictionary from the first candidate character string as a first target character string and an unmatched character string as a second character string. The computer device may perform, through the second filtering module 205, the step of finding a character string with a component word score greater than a second preset threshold value from the second character string as a second target character string, where the first target character string and the second target character string constitute a target character string. The knowledge base query module 206 performs a search to find whether the target character string is located in a preset knowledge base, where the preset knowledge base includes the character string and a corresponding category. The computer device may execute, through the classification module 207, taking a category of the target character string in the preset knowledge base as a target category when the target character string is located in the preset knowledge base.

In one embodiment, a computer device is provided, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program: acquiring text data, wherein the text data comprises a plurality of candidate character strings; calculating the score of each candidate character string according to a preset word function to obtain the word composition score of each candidate character string; screening out character strings with the word score larger than or equal to a first preset threshold value from the candidate character strings as first candidate character strings; searching a character string matched with a character string in a preset dictionary from the first candidate character string to serve as a first target character string, and using an unmatched character string as a second character string; finding out character strings with the word score larger than a second preset threshold value from the second character string as a second target character string, wherein the first target character string and the second target character string form a target character string; searching whether the target character string is located in a preset knowledge base or not, wherein the preset knowledge base comprises the character string and a corresponding category; and when the target character string is located in the preset knowledge base, taking the category of the target character string in the preset knowledge base as the target category of the target character string.

In one embodiment, the processor, when executing the computer program, further performs the steps of: when the target character string is not located in the preset knowledge base and comprises a plurality of characters, performing word segmentation on the target character string to obtain a plurality of character units of each target character string; and when the part of speech of the character unit at the preset position in the character units of the target character string is a noun and the character unit at the preset position is located in the preset knowledge base, taking the category of the character unit at the preset position in the preset knowledge base as the target category of the target character string.

In one embodiment, the processor, when executing the computer program, further performs the steps of: when the target character string is not located in the preset knowledge base and the last character of the target character string is a preset prop type character, setting the type of the target character string as a prop; and when the target character string is not located in the preset knowledge base, the target character string comprises a plurality of character units, and the last character of the target character string is not a preset prop type character, performing word segmentation on the target character string to obtain a plurality of character units of each target character string.

In one embodiment, the target category includes props and actor characters, and when the target character string is located in the preset knowledge base, the category of the target character string in the preset knowledge base is taken as the target category, including: when the category corresponding to the target character string is a first preset category, setting the category of the target character string as a prop; and when the category corresponding to the target character string is the actor character, setting the category of the target character string as the actor character.

In one embodiment, the candidate character strings include a character string composed of a single word and a character string composed of a plurality of character units, and the score of each candidate character string is calculated according to a preset word function to obtain the component word score of each candidate character string, including: when the candidate character string is a character string consisting of single characters, acquiring a preset value as a component word score of the candidate character string; when the candidate character string is a character string consisting of a plurality of character units, calculating the component word score of the candidate character string by adopting a preset word function to obtain a second score, wherein the preset word function is as follows:

Score＝min(HL(W)，HR(W))*PMI(x,y)

wherein HL (W) is the left entropy of the candidate character string W,

In one embodiment, acquiring a preset value as the component word score of the candidate character string includes using a fixed value 0 as the component word score of the candidate character string.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of: acquiring text data, wherein the text data comprises a plurality of candidate character strings; calculating the score of each candidate character string according to a preset word function to obtain the word composition score of each candidate character string; screening out character strings with the word score larger than or equal to a first preset threshold value from the candidate character strings as first candidate character strings; searching a character string matched with a character string in a preset dictionary from the first candidate character string to serve as a first target character string, and using an unmatched character string as a second character string; finding out character strings with the word score larger than a second preset threshold value from the second character string as a second target character string, wherein the first target character string and the second target character string form a target character string; searching whether the target character string is located in a preset knowledge base or not, wherein the preset knowledge base comprises the character string and a corresponding category; and when the target character string is located in the preset knowledge base, taking the category of the target character string in the preset knowledge base as the target category of the target character string.

In one embodiment, the computer program when executed by the processor further performs the steps of: when the target character string is not located in the preset knowledge base and comprises a plurality of characters, performing word segmentation on the target character string to obtain a plurality of character units of each target character string; and when the part of speech of the character unit at the preset position in the character units of the target character string is a noun and the character unit at the preset position is located in the preset knowledge base, taking the category of the character unit at the preset position in the preset knowledge base as the target category of the target character string.

In one embodiment, the computer program when executed by the processor further performs the steps of: when the target character string is not located in the preset knowledge base and the last character of the target character string is a preset prop type character, setting the type of the target character string as a prop; and when the target character string is not located in the preset knowledge base, the target character string comprises a plurality of character units, and the last character of the target character string is not a preset prop type character, performing word segmentation on the target character string to obtain a plurality of character units of each target character string.

Score＝min(HL(W)，HR(W))*PMI(x,y)

whereinHL (W) is the left entropy of the candidate string W,

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing are merely exemplary embodiments of the present invention, which enable those skilled in the art to understand or practice the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of mining text data, the method comprising:

calculating the score of each candidate character string according to a preset word function to obtain the component word score of each candidate character string;

screening out character strings with the component word score being larger than or equal to the first preset threshold value from the candidate character strings to serve as first candidate character strings;

finding out a character string with the component word score larger than a second preset threshold value from the second character string as a second target character string, wherein the first target character string and the second target character string form a target character string;

and when the target character string is located in a preset knowledge base, taking the category of the target character string in the preset knowledge base as the target category of the target character string.

2. The method of claim 1, further comprising:

when the target character string is not located in the preset knowledge base and comprises a plurality of characters, performing word segmentation on the target character string to obtain a plurality of character units of each target character string;

and when the part of speech of the character unit at the preset position in the character units of the target character string is a noun and the character unit at the preset position is located in the preset knowledge base, taking the category of the character unit at the preset position in the preset knowledge base as the target category of the target character string.

3. The method of claim 2, further comprising:

when the target character string is not located in the preset knowledge base and the last character of the target character string is a preset prop type character, setting the type of the target character string as a prop;

and when the target character string is not located in the preset knowledge base, the target character string comprises a plurality of characters, and the last character of the target character string is not a preset property character, executing the word segmentation of the target character string to obtain a plurality of character units of each target character string.

4. The method of claim 1, wherein the target categories comprise props and actor characters, and the step of regarding the categories of the target character strings in a preset knowledge base as the target categories of the target character strings when the target character strings are in the preset knowledge base comprises the steps of:

when the category corresponding to the target character string is a first preset category, setting the category of the target character string as a prop;

and when the category corresponding to the target character string is the actor role, setting the category of the target character string as the actor role.

5. The method according to any one of claims 1 to 4, wherein the candidate character strings include a character string composed of a single character and a character string composed of a plurality of character units, and the calculating of the score of each candidate character string according to a preset word function to obtain the word composition score of each candidate character string includes:

when the candidate character string is a character string consisting of single characters, acquiring a preset value as a component word score of the candidate character string;

when the candidate character string is a character string composed of a plurality of character units, calculating the Score of each component word of the candidate character string by using the preset word function to obtain a second Score, wherein the preset word function is Score min (HL (W), HR (W)) PMI (x, y), and HL (W) is the left entropy of the candidate character string W,

KL is the number of strings to the left of the candidate string W, pl_kFrequency of occurrence of the k-th string to the left of the candidate string W, hr (W) right entropy of the candidate string W,

KR is the number of character strings to the right of the candidate character string W, pr_kThe occurrence frequency of the kth character string on the right side of the candidate character string W, min ((HL (W), HR (W)) means that the minimum value is selected from the left entropy and the right entropy, PMI (x, y) is mutual information of the candidate character string W,

x is the character unit on the left side of the candidate character string W, y is the character unit on the right side of the candidate character string W, p (x) is the probability of x occurring independently, p (y) is the probability of y occurring independently, and p (x, y) is the probability of x and y occurring simultaneously.

6. The method of claim 5, wherein the predetermined value is 0.

7. An apparatus for mining text data, the apparatus comprising:

the first screening module is used for screening out character strings of which the component word scores are greater than or equal to the first preset threshold value from the candidate character strings as first candidate character strings;

the dictionary query module is used for searching a character string matched with a character string in a preset dictionary from the first candidate character string to serve as a first target character string, and an unmatched character string serves as a second character string;

the second screening module is used for finding out a character string with the component word score larger than a second preset threshold value from the second character string as a second target character string, and the first target character string and the second target character string form a target character string;

the knowledge base query module is used for searching whether the target character string is located in a preset knowledge base or not, wherein the preset knowledge base comprises the character string and a corresponding category;

and the classification module is used for taking the category of the target character string in a preset knowledge base as the target category of the target character string when the target character string is positioned in the preset knowledge base.

8. The apparatus of, wherein the apparatus comprises:

the word segmentation module is used for segmenting the target character string to obtain a plurality of character units of each target character string when the target character string is not located in the preset knowledge base and the target character string comprises a plurality of characters;

the classification module is further configured to, when the part of speech of the character unit located at the preset position in the character units of the target character string is a noun and the character unit located at the preset position is located in the preset knowledge base, use the category of the character unit located at the preset position in the preset knowledge base as the target category of the target character string.

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 6 are implemented when the computer program is executed by the processor.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.