WO2022062523A1 - 一种基于人工智能的文本挖掘方法、相关装置及设备 - Google Patents

一种基于人工智能的文本挖掘方法、相关装置及设备 Download PDF

Info

Publication number
WO2022062523A1
WO2022062523A1 PCT/CN2021/102745 CN2021102745W WO2022062523A1 WO 2022062523 A1 WO2022062523 A1 WO 2022062523A1 CN 2021102745 W CN2021102745 W CN 2021102745W WO 2022062523 A1 WO2022062523 A1 WO 2022062523A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
domain
domain candidate
candidate word
text
Prior art date
Application number
PCT/CN2021/102745
Other languages
English (en)
French (fr)
Inventor
蒋杰
杜广雷
石志林
张长旺
张纪红
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2022062523A1 publication Critical patent/WO2022062523A1/zh
Priority to US18/073,519 priority Critical patent/US20230111582A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • G06F40/129Handling non-Latin characters, e.g. kana-to-kanji conversion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/26Visual data mining; Browsing structured data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/53Processing of non-Latin text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • This application relates to the field of natural language processing and big data processing, and in particular, to text mining.
  • NLP natural language processing
  • new words can be found based on statistical methods. This method first needs to obtain candidate words, and then calculate the probability of word formation according to the statistical feature values of the candidate words. Candidate words whose degree of solidification and degree of freedom exceeds a certain feature threshold are regarded as new words.
  • the embodiments of the present application provide an artificial intelligence-based text mining method, related devices, and equipment, which can use machine learning algorithms to filter out new words through domain candidate words, avoid the process of manually setting a large number of feature thresholds, and reduce labor costs. Cost, thus, can be well adapted to the rapid emergence of specialized new words in the Internet age.
  • an artificial intelligence-based text mining method including:
  • the domain seed word is determined to be the domain new word.
  • a text mining device comprising:
  • the acquisition module is used to acquire the domain candidate word feature corresponding to the domain candidate word
  • the obtaining module is also used to obtain the word quality score corresponding to the domain candidate word according to the feature of the domain candidate word;
  • a determination module used to determine new words according to the word quality scores corresponding to the domain candidate words
  • the acquisition module is also used to acquire the associated text according to the new word
  • the determining module is further configured to determine that the domain seed word is a domain new word if the domain seed word is determined according to the associated text and satisfies the domain new word mining condition.
  • a computer device including: a memory, a transceiver, a processor, and a bus system;
  • the memory is used to store the program
  • the processor is used to execute the program in the memory, including the method of executing the above aspects;
  • the bus system is used to connect the memory and the processor so that the memory and the processor can communicate.
  • Another aspect of the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and the computer program is used to execute the method of the above aspect.
  • Another aspect of the present application provides a computer program product or computer program, the computer program product or computer program comprising computer instructions stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the method provided by the above aspects.
  • a computer program product comprising instructions, when run on a computer, causes the computer to perform the method provided by the above aspects.
  • the embodiments of the present application have the following advantages:
  • an artificial intelligence-based text mining method is provided, first obtaining the domain candidate word features corresponding to the domain candidate word, and then obtaining the word quality score corresponding to the domain candidate word according to the domain candidate word feature, Then, the new words are determined according to the word quality scores corresponding to the domain candidate words, and the associated text is obtained according to the new words. If it is determined according to the associated text that the domain seed word satisfies the domain new word mining conditions, the domain seed word is determined to be the domain new word.
  • new words can be screened out through domain candidate words based on machine learning algorithms, avoiding the process of manually setting a large number of feature thresholds, thereby reducing labor costs, and thus, can be well adapted to the rapid emergence of specialization in the Internet age. new word.
  • FIG. 1 is a schematic structural diagram of a text mining system in an embodiment of the application
  • FIG. 2 is a schematic diagram of generating a domain corpus based on big data in an embodiment of the application
  • FIG. 3 is a schematic diagram of an embodiment of an artificial intelligence-based text mining method in an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of generating sub-scores based on a decision tree in an embodiment of the application
  • FIG. 5 is a schematic diagram of generating a word quality score based on a random forest model in an embodiment of the application
  • FIG. 6 is a schematic diagram of an interface for displaying search feedback results through a search engine in an embodiment of the present application
  • FIG. 7 is a schematic diagram of an interface for manually inputting field seed words in the embodiment of the application.
  • FIG. 9 is a schematic diagram of a training framework of the random forest model in the embodiment of the present application.
  • FIG. 10 is a schematic flowchart of training a text score estimation model in the embodiment of the application.
  • FIG. 11 is a schematic diagram of an overall flow of a text mining method in an embodiment of the present application.
  • FIG. 12 is a schematic diagram of an embodiment of a text mining apparatus in an embodiment of the present application.
  • FIG. 13 is a schematic structural diagram of a server in an embodiment of the present application.
  • FIG. 14 is a schematic structural diagram of a terminal device in an embodiment of the present application.
  • the embodiments of the present application provide an artificial intelligence-based text mining method, related devices, and equipment, which can use machine learning algorithms to filter out new words through domain candidate words, avoid the process of manually setting a large number of feature thresholds, and reduce labor costs. Cost, thus, can be well adapted to the rapid emergence of specialized new words in the Internet age.
  • the present application provides a text mining method based on artificial intelligence, which is used to discover new words and can further discover new words in the field.
  • the text mining method provided in this application is applied to the field of artificial intelligence (Artificial Intelligence, AI), and specifically relates to natural language processing technology and machine learning (Machine Learning, ML).
  • AI Artificial Intelligence
  • ML Machine Learning
  • this application proposes a text mining method based on artificial intelligence.
  • the method is applied to the text mining system shown in FIG. 1 .
  • the text mining system includes a server and a terminal. device, the client is deployed on the terminal device, and the text mining platform is deployed on the server as the text mining device.
  • the server involved in this application may be an independent physical server, a server cluster or a distributed system composed of multiple physical servers, or a cloud service, cloud database, cloud computing, cloud function, cloud Cloud servers for basic cloud computing services such as storage, network services, cloud communications, middleware services, domain name services, security services, Content Delivery Network (CDN), and big data and artificial intelligence platforms.
  • the terminal device may be a smart phone, a tablet computer, a notebook computer, a palmtop computer, a personal computer, a smart TV, a smart watch, etc., but is not limited thereto.
  • the terminal device and the server can be directly or indirectly connected through wired or wireless communication, which is not limited in this application.
  • the number of servers and terminal devices is also not limited.
  • FIG. 2 is a schematic diagram of generating a domain corpus based on big data in an embodiment of the application.
  • the text mining platform accesses data in the data platform, thereby obtaining files, wherein the The file can be a network file that a user has accessed through a browser, or a network file that is continuously captured from various websites through web crawling technology.
  • the collected files are sorted according to the collection time, and the content of each file is parsed, and the qualified text is extracted and added to the domain corpus. It can also perform word segmentation processing, domain candidate word extraction processing, and sentiment analysis processing on the text in the domain corpus, and then realize new word discovery, content matching, and thesaurus matching operations.
  • Libraries include but are not limited to industry thesaurus, sentiment thesaurus and spam thesaurus. Based on the results of new word discovery, topic statistics, hot word statistics, sentiment analysis, and content classification can also be performed, and finally the application of data can be realized.
  • the data platform can provide big data, which belongs to a branch of cloud technology.
  • Cloud technology refers to the unification of a series of resources such as hardware, software, and network in a wide area network or a local area network.
  • New words in the field Proprietary words or common words mainly used in a certain field, for example, "Glory of the King” and “Eating Chicken” are new words in the field of games.
  • a company releases a new game called “Save the Gopher”. This new game did not exist before, then "Save the Gopher" is a new term in the field.
  • Domain seed words mainly refer to words that often appear in domain texts and can represent the meaning of the domain to a certain extent.
  • the field seed words may be "mobile phone”, “game”, “mobile game”, “game application” and so on.
  • Word segmentation It is the process of recombining consecutive word sequences into word sequences according to certain specifications. Existing word segmentation algorithms can be divided into three categories: word segmentation methods based on string matching, word segmentation methods based on understanding, and word segmentation methods based on statistics. According to whether it is combined with the part-of-speech tagging process, it can be divided into pure word segmentation method and integrated method of word segmentation and tagging.
  • Remote Supervised Learning In this application, it refers to the use of general domain or thesaurus of a certain domain to guide new word mining and discovery in other domains.
  • N-Gram It is an algorithm based on statistical language model. Its basic idea is to perform a sliding window operation of size N on the content of the text according to single-word words to form a character sequence of length N. In this application, domain candidate words can be generated using the N-Gram algorithm.
  • Random Forest It is an ensemble learning algorithm composed of decision trees. Each decision tree independently predicts events, and the final result is weighted and determined by all the prediction results.
  • Positive sample pool It consists of positive samples of domain candidate words. During model training, the training data extracts positive samples of domain candidate words from the positive sample pool.
  • Negative sample pool It consists of a large number of negative samples of domain candidates and few positive samples of possible domain candidates. During model training, the training data extracts negative samples of domain candidate words from the negative sample pool.
  • Coagulation degree It indicates the degree of closeness between the internal characters of a candidate word in a domain, and is generally measured by the posterior probability of the fixed collocation of characters. For example, the degree to which "afraid” and “get angry” are used at the same time, if “afraid” is only used in combination with “get angry”, the solidification degree of the two is high, which means one word. To calculate the degree of solidification, it is necessary to first calculate the probability of P ("fear of getting angry”), P (" afraid of getting angry”) and P (" getting angry”), which is the probability of occurrence in the domain candidate words.
  • Degree of freedom Indicates the degree to which a field candidate word can be used independently and freely. Generally, the left and right information entropy of words is used to measure the degree of freedom. For example, the solidification degree of "chocolate” in “chocolate” is very high, as high as that of "chocolate”, but its degree of free use is almost zero, so “chocolate” cannot be a word by itself.
  • Term frequency The frequency of a given domain candidate word appearing in the text, that is, the ratio of the total number of domain candidate words appearing in the text to the total domain candidate words contained in the text.
  • IDF Inverse document frequency
  • Term frequency-inverse document frequency (term frequency-inverse document frequency, TFIDF) value a commonly used weighting technique for information retrieval and data mining, the value is the product of term frequency (TF) and inverse document frequency (IDF).
  • TF term frequency
  • IDF inverse document frequency
  • Left information entropy used to measure the richness of the left collocation of the domain candidate words. Calculated using the formula, where x is all possible collocations (i.e. random variables) to the left of the domain candidate word. The possible collocations on the left are all the words that have appeared on the left side of the domain candidate words in the analyzed content. For example, “Hello, Little White Rabbit”, “Haha, hello, Little White Rabbit”, “What are you doing, Little White Rabbit”, all possible collocations to the left of "Little White Rabbit" are "Hello” and "What are you doing?" ".
  • the calculation formula of information entropy is as follows:
  • H(x) represents the information entropy of the random variable x
  • p(x i ) represents the probability of the i-th random event
  • n represents the total number of random events.
  • Right-side information entropy used to measure the richness of the right-side collocations of domain candidate words. Calculated using the formula, where x is all possible collocations (i.e. random variables) to the right of the domain candidate. The possible collocation on the right side is that in the content of the analysis, the domain candidate word is next to all the words that have appeared on the right side.
  • the text mining method based on artificial intelligence in the present application will be introduced below. Please refer to FIG. 3 .
  • This embodiment can be performed by a text mining device.
  • An embodiment of the artificial intelligence text mining method in the embodiment of the present application includes: :
  • the text mining device first obtains a large number of sentences (for example, 100,000 sentences) from the domain corpus, and then performs word segmentation on each sentence to obtain domain candidate words, where the domain candidate words refer to one or more
  • the plurality of domain candidate words for example, includes P domain candidate words (P is an integer greater than or equal to 1). Domain candidate words are not repeated, and each domain candidate word can extract a corresponding domain candidate word feature.
  • the text mining device may be a server or a terminal device, which is not limited in this application.
  • the field candidate word involved in this application may be one word, or may be a set including at least two words.
  • the text mining device uses the domain candidate word feature as the input of the text score prediction model, and the text score prediction model outputs the word quality score corresponding to the domain candidate word feature, that is, the word quality score and the domain Candidate words also have correspondences.
  • the higher the word quality score the greater the possibility that the candidate word in this field belongs to the high-quality word, and the high-quality word indicates that the word has reasonable semantics. For example, "chocolate” is a high-quality word, and "gram” is a high-quality word. "Li” has no complete and reasonable semantics, so it is not a high-quality word.
  • the text mining device can filter out new words from the domain candidate words according to the word quality scores corresponding to the domain candidate words.
  • the new words here refer to one or more new words, for example, including Q new words (Q is an integer greater than or equal to 1). For example, if the word quality score of the domain candidate word "King Zhe" reaches the quality score threshold, the domain candidate word "King Zhe” can be used as a new word.
  • Q is an integer greater than or equal to 1
  • the domain candidate word "King Zhe” can be used as a new word.
  • the frequency of occurrence of candidate words in the field it means that the candidate words in this field may not be common words, that is, it is determined that the candidate words in this field do not belong to new words.
  • the text mining device crawls relevant associated texts from the search engine according to the new words.
  • the related text can be understood as a set of texts, or a set including at least two sets of texts, and one related text can be crawled for each new word.
  • the associated text can be embodied in the form of a document, and multiple statements are recorded in each associated text.
  • domain seed word is the domain new word. If it is determined according to the associated text that the domain seed word satisfies the domain new word mining condition, then determine that the domain seed word is the domain new word.
  • the text mining device also needs to obtain the domain seed word from the domain seed word database, and then calculate the occurrence probability of the domain seed word in the associated text. If the occurrence probability reaches the threshold, it means that the domain new word mining conditions are met, Therefore, the domain seed word can be marked as a domain new word. On the contrary, if the occurrence probability does not reach the threshold, it means that the domain new word mining conditions are not met, that is, the domain seed word is considered not to be a domain new word.
  • the occurrence probability of these 5000 domain seed words in the associated text can be calculated separately, and then it can be judged whether each domain seed word satisfies the domain new word mining conditions. Domain new word mining conditions, the domain seed word can be determined as a domain new word.
  • the text mining method provided in this application can be applied to the discovery of new words on short texts of social network group names.
  • the accuracy rate of new words ranked in the top 100 in the test reaches 92.7%, and the accuracy rate of new words in the field reaches 82.4%.
  • the accuracy rate of testing the overall new words reached 84.5%, and the accuracy rate of domain new words reached 67.2%. It can be seen that the text mining method based on artificial intelligence provided in this application can better mine new words in the field.
  • an artificial intelligence-based text mining method is provided, first obtaining the domain candidate word features corresponding to the domain candidate word, and then obtaining the word quality score corresponding to the domain candidate word according to the domain candidate word feature, Then, the new words are determined according to the word quality scores corresponding to the domain candidate words, and the associated text is obtained according to the new words. If it is determined according to the associated text that the domain seed word satisfies the domain new word mining conditions, the domain seed word is determined to be the domain new word.
  • new words can be screened out through domain candidate words based on machine learning algorithms, avoiding the process of manually setting a large number of feature thresholds, thereby reducing labor costs, and thus, can be well adapted to the rapid emergence of specialization in the Internet age. new word.
  • a method for determining domain candidate words is introduced.
  • the text mining device obtains sentences from a domain corpus, where a sentence refers to one or more sentences, for example, including M sentences (M is greater than or equal to 1 an integer).
  • the corpus stores the language materials that have actually appeared in the actual use of the language.
  • the corpus is the basic resource that carries the language knowledge with the electronic computer as the carrier.
  • the real corpus needs to be analyzed and processed before it can become a useful resource.
  • the domain corpus is a corpus for a certain domain, for example, a corpus in the game domain, or a corpus in the medical domain, etc. This application does not limit the type of domain corpus.
  • the text mining equipment performs word segmentation on the sentences derived from the domain corpus to obtain the corresponding text sequence.
  • dictionary-based word segmentation algorithm or machine learning algorithm can be used to implement.
  • Dictionary-based word segmentation algorithms include forward maximum matching method, reverse maximum matching method, and two-way matching word segmentation method.
  • Machine learning-based algorithms include conditional random field (CRF), Hidden Markov Model (HMM), and Support Vector Machine (SVM).
  • the text sequence obtained is "Beijing/Fuwa/Yes/Olympics/Mascot", where "/" represents the separation between words. symbol.
  • at least one domain candidate word can be extracted from the text sequence.
  • at least one domain candidate word can be extracted, namely "Beijing", “Fuwa” , “Yes”, “Olympics” and "Mascot”.
  • N-Gram algorithm can also be used to extract the domain candidate words from the text sequence, or the supervised algorithm can be used to extract the domain candidate words from the text sequence, or the semi-supervised algorithm can be used to extract the domain candidate words from the text sequence. Or use an unsupervised algorithm to extract domain candidate words from the text sequence, etc., which is not limited here.
  • indicators such as word frequency, TFIDF value, degree of coagulation, degree of freedom, left information entropy, right information entropy, word length, mutual information, location information, and word span of domain candidate words in the sentence can be counted. or multiple indicators as the domain candidate word features corresponding to the domain candidate words.
  • a method for extracting the features of domain candidate words is provided.
  • sentences are obtained from the domain corpus, and then the sentences are subjected to word segmentation processing, and the text sequences after these word segmentation are used as the source of the domain candidate words.
  • word segmentation processing processing the text sequences after these word segmentation are used as the source of the domain candidate words.
  • the relevant domain candidate words and then further extract the domain candidate word features corresponding to each domain candidate word, thereby improving the feasibility and operability of the solution.
  • acquiring domain candidate words according to a text sequence specifically includes the following steps:
  • the domain candidate words corresponding to the text sequence are obtained, wherein the word number sampling threshold represents the upper limit of the number of words in the domain candidate words, and the character number sampling threshold represents the domain candidate word.
  • the upper limit for the number of characters in a word is obtained.
  • a method for obtaining domain candidate words based on the N-Gram algorithm is introduced, and the N-Gram algorithm is used to sample a text sequence to obtain domain candidate words.
  • the N-Gram algorithm involves two hyperparameters, namely the word number sampling threshold (N) and the character number sampling threshold (maxLen).
  • the word number sampling threshold is used to control the maximum number of words that can be selected for combination, that is, the field The upper limit of the number of words for candidate words.
  • the number of characters sampling threshold is used to control the maximum length of the domain candidate words, that is, the upper limit of the number of characters in the domain candidate words.
  • a method for obtaining domain candidate words based on the N-Gram algorithm is provided.
  • the N-Gram algorithm can be used to evaluate whether a sentence is reasonable, and can also be used to evaluate two characters.
  • the degree of difference between strings, the N-gram algorithm contains all the information that the first few words can provide. These words have a strong binding force on the appearance of the current word, which is conducive to extracting more accurate and richer domain candidate words. .
  • acquiring domain candidate word features according to domain candidate words specifically includes the following steps: :
  • the domain candidate word features corresponding to the domain candidate words are obtained according to the text sequence, wherein the domain candidate word features include at least one of word frequency, word frequency inverse document frequency TFIDF value, degree of freedom, degree of solidification, left information entropy and right information entropy item.
  • a method for extracting the features of domain candidate words is introduced.
  • the word frequency, TFIDF value, degree of freedom, degree of solidification, left and right corresponding to the domain candidate word can be extracted.
  • side information entropy and right information entropy, etc. The following will take the domain candidate word "Fuwa" as an example to introduce the method of acquiring the features of the domain candidate word.
  • the word frequency of the domain candidate word "Fuwa” represents the probability that the domain candidate word appears in the sentence (or text sequence). Usually, if a word appears more frequently in the text, then the word is more likely to be the core word. Assuming that the domain candidate word "Fuwa” appears m times in the sentence (or text sequence), and the total number of words in the sentence (or text sequence) is n, the calculation method of the word frequency of the domain candidate word "Fuwa" is:
  • w represents the domain candidate word "Fuwa”
  • TF w represents the word frequency of the domain candidate word "Fuwa”
  • m represents the number of times the domain candidate word "Fuwa” appears in the sentence (or text sequence)
  • n represents the sentence (or text sequence) ) of the total number of words in the .
  • the TFIDF value of the domain candidate word "Fuwa” is calculated from two parts, namely the word frequency and the inverse document frequency.
  • the inverse document frequency of the domain candidate word "Fuwa” represents the frequency of the domain candidate word in the domain corpus.
  • w represents the domain candidate word "Fuwa”
  • IDF w represents the inverse document frequency of the domain candidate word "Fuwa”
  • X represents the number of sentences in the domain corpus that include the domain candidate word "Fuwa”
  • Y represents the domain corpus that includes the domain candidate word The total number of sentences for "Fuwa”.
  • TFIDF w TF w ⁇ IDF w ;
  • w represents the domain candidate word "Fuwa”
  • TF w represents the word frequency of the domain candidate word "Fuwa”
  • IDF w represents the inverse document frequency of the domain candidate word "Fuwa”.
  • the domain candidate word "Fuwa” can use entropy to measure the degree of freedom.
  • the probability of each Chinese character appearing on the right side of the field candidate word "Fuwa" can be calculated, and the information entropy on the right side can be calculated according to the entropy formula, and the smaller of the left neighbor entropy and the right neighbor entropy of a word can be used as the final degree of freedom.
  • H(w) represents the information entropy of the field candidate word "Fuwa”
  • p( wi ) represents the probability of the i-th field candidate word "Fuwa”
  • C represents the total number of random events.
  • a method for extracting the features of domain candidate words is provided.
  • the feature quantification of domain candidate words can be carried out, from the dimensions of word weight, word position in the document, and word related information. , extract the relevant features of the domain candidate words, and thus constitute the domain candidate word features.
  • the domain candidate word features can well express the characteristics of the domain candidate words and help to obtain more accurate domain candidate word evaluation results.
  • the word corresponding to the domain candidate word is obtained according to the feature of the domain candidate word.
  • the quality score includes the following steps:
  • the word quality score corresponding to the domain candidate word is obtained.
  • the text score prediction model can be a decision tree model, a gradient boosting decision tree (Gradient Boosting Decision Tree, GBDT), a gradient boosting (XGBoost) algorithm, or a random forest (Random Forest, RF) model, etc.
  • the estimation model is a random forest model as an example to illustrate.
  • the random forest model consists of T decision trees, and there is no relationship between each decision tree. After the random forest model is obtained, when the domain candidate word features corresponding to the domain candidate words are input, the parameters in the random forest model are used. Each decision tree makes a judgment, that is, whether the candidate word in the field is a high-quality word. If the candidate word in the field is a high-quality word, the decision tree is a candidate word in the field and recorded as "score". If the candidate word in the field does not belong to the high-quality word word, the decision tree is a domain candidate word and recorded as "no score". For ease of understanding, please refer to FIG. 4. FIG. 4. FIG.
  • FIG. 4 is a schematic structural diagram of generating sub-scores based on a decision tree in an embodiment of the application.
  • the domain candidate word feature corresponding to the domain candidate word "Fuwa” is input into One of the decision trees, the decision tree first judges the next branch based on the word frequency included in the domain candidate word feature. Assuming that the word frequency included in the domain candidate word feature is 0.2, continue to judge whether the TFIDF value included in the domain candidate word feature is greater than 0.5. Assuming that the TFIDF value included in the domain candidate word feature is 0.8, continue to judge whether the right information entropy included in the domain candidate word feature is greater than 0.8. Assuming that the right information entropy included in the feature of the domain candidate word is 0.9, the domain candidate word "Fuwa" is determined to be scored 1 point, that is, the sub-score of the decision tree output is 1.
  • FIG. 5 is a schematic diagram of generating a word quality score based on a random forest model in an embodiment of the present application. As shown in the figure, taking T equal to 100 as an example, 100 sub-scores, that is, word quality scores, can be obtained The full score of 100.
  • weight values can also be assigned to different decision trees.
  • the weight value of decision tree 1 to decision tree 10 is 1, and the weight value of decision tree 11 to decision tree 100 is 0.5.
  • Different weight values Multiply with the corresponding sub-score and accumulate to get the final word quality score.
  • a method for outputting word quality scores by using a random forest model is provided.
  • the word quality scores predicted by using the random forest model have a high accuracy rate, and through multiple Decision trees can effectively evaluate the importance of domain candidate word features in classification problems.
  • the new word is determined according to the word quality score corresponding to the domain candidate word. , including the following steps:
  • the word quality score corresponding to the domain candidate word is greater than or equal to the quality score threshold, it is determined that the domain candidate word belongs to a new word
  • the word quality score corresponding to the domain candidate word is less than the quality score threshold, it is determined that the domain candidate word does not belong to a new word.
  • a method for judging new words based on word quality scores is introduced.
  • a domain candidate word is used as an example for introduction, and other domain candidate words are also determined in a similar manner to determine whether they belong to new words, which will not be repeated here.
  • the quality score threshold equal to 60 as an example.
  • the first case is that it is assumed that the word quality score of the domain candidate word is 80, that is, the word quality score 80 corresponding to the domain candidate word is greater than the quality score threshold of 60. Therefore, the domain candidate word can be used as the new word. Included domain candidates.
  • the word quality score of the domain candidate word is 50, that is, the word quality score 50 corresponding to the domain candidate word is less than the quality score threshold of 60, so it can be determined that the domain candidate word is not a new word.
  • a method for judging new words based on the word quality score is provided.
  • a domain candidate word with a higher word quality score is used as a new word, which can ensure new words to a certain extent.
  • the words have high quality and can be used as candidates for new words in the domain, thereby improving the reliability and accuracy of newly selected new words.
  • the new word is determined according to the word quality score corresponding to the domain candidate word. , including the following steps:
  • the word quality score corresponding to the domain candidate word is greater than or equal to the quality score threshold, and the word frequency corresponding to the domain candidate word is greater than or equal to the first word frequency threshold, it is determined that the domain candidate word is a new word;
  • the word quality score corresponding to the domain candidate word is less than the quality score threshold, or the word frequency corresponding to the domain candidate word is less than the first word frequency threshold, it is determined that the domain candidate word does not belong to a new word.
  • a method for jointly judging new words based on word quality scores and word frequencies is introduced.
  • a domain candidate word is used as an example for introduction, and other domain candidate words are also determined in a similar manner to determine whether they belong to new words, which will not be repeated here.
  • the quality score threshold equal to 60 and the first word frequency threshold equal to 0.2 as an example.
  • the first case is, assuming that the word quality score of the domain candidate word is 80, the word frequency corresponding to the domain candidate word is 0.5, that is, the word quality score 80 corresponding to the domain candidate word is greater than the quality score threshold 60, and the domain candidate word The word frequency 0.5 corresponding to the word is greater than or equal to the first word frequency threshold 0.2, thus, the domain candidate word can be used as the domain candidate word included in the new word.
  • the word quality score of the domain candidate word is 50
  • the word frequency corresponding to the domain candidate word is 0.5
  • the word quality score 50 corresponding to the domain candidate word is less than the quality score threshold of 60
  • the domain candidate word The word frequency 0.5 corresponding to the word is greater than the first word frequency threshold 0.2, so it can be determined that the candidate word in this field does not belong to the new word.
  • the third case is that it is assumed that the word quality score of the domain candidate word is 80, and the word frequency corresponding to the domain candidate word is 0.1, that is, the word quality score 80 corresponding to the domain candidate word is greater than the quality score threshold of 60, and the domain candidate word The word frequency 0.1 corresponding to the word is less than the first word frequency threshold 0.2, so it can be determined that the candidate word in this field does not belong to the new word.
  • the fourth case is, assuming that the word quality score of the domain candidate word is 50, the word frequency corresponding to the domain candidate word is 0.1, that is, the word quality score 50 corresponding to the domain candidate word is less than the quality score threshold 60, and the domain candidate word The word frequency 0.1 corresponding to the word is less than the first word frequency threshold 0.2, so it can be determined that the candidate word in this field does not belong to the new word.
  • a method for jointly judging new words based on word quality score and word frequency is provided.
  • a domain candidate word with a higher word quality score is used as a new word, which can be used to a certain extent.
  • the new words are guaranteed to be of high quality and can be used as candidates for new words in the field, thereby improving the reliability and accuracy of the newly selected new words.
  • selecting words with higher word frequency as new words can ensure that new words have a higher transmission rate to a certain extent and are more in line with the definition of new words.
  • obtaining the associated text according to the new word specifically includes the following steps:
  • search feedback results corresponding to new words through a search engine, wherein the search feedback results include at least one search result
  • the first R search results with the highest relevance are determined from at least one search result as the associated text corresponding to the new word, where R is an integer greater than or equal to 1.
  • a method for acquiring associated text is introduced. After the new words are acquired, the new words need to be searched.
  • a new word is used as an example for introduction, and other new words are also obtained in a similar manner to obtain associated texts, which will not be repeated here.
  • a search feedback result can be obtained after inputting the candidate word in the field into a search engine, and the search feedback result includes at least one search result.
  • FIG. 6 is a schematic diagram of an interface for displaying search feedback results through a search engine in this embodiment of the application. As shown in the figure, a search feedback result is obtained after inputting the field candidate word "King", and the search feedback The results include 10 search results, and the results shown in Table 1 are obtained after sorting the 10 search results in descending order of relevance.
  • the top R search results with the highest degree of relevance can be used as the associated text of the domain candidate word "Shang Wang Zhe". Assuming that R is equal to 5, then the associated text includes 5 search results, which are " What kind of mobile phone should I use to beat the king?", “What should I do if I get stuck in the game?", “Do I pay traffic to the king?”, “What song does the king listen to with rhythm?” and "The software that the king can make money.”
  • a method for obtaining associated text is provided.
  • the search feedback result of the search engine is used as a criterion for evaluating the frequency of use of new words, which can be closer to the actual situation of the use of new words. Helps to find relevant texts related to new words in the field.
  • the domain seed word satisfies the domain new word mining condition.
  • a method for judging whether a domain seed word satisfies a domain new word mining condition based on an average word frequency is introduced.
  • the domain seed word is usually a manually entered word.
  • FIG. 7 is an embodiment of the application.
  • FIG. 7 A schematic diagram of an interface for manually inputting domain seed words, as shown in the figure, users can input new domain seed words or delete existing domain seed words through the interface for manually entering domain seed words, each domain seed word corresponds to a Word identification, and each field seed word needs to be marked with its corresponding field.
  • the "mobile game” field it can include field seed words "playing the king", "eat chicken” and "up points", if you need to add new field seed words, that is, click "+” and enter relevant information.
  • the to-be-processed word frequency of the domain seed word "eat chicken” is calculated based on the associated text.
  • Table 2 shows the domain seed words in the associated text. An indication of the frequency of the words to be processed.
  • the associated text here refers to one or more associated texts, for example, including Q associated texts (Q is an integer greater than or equal to 1), that is, associated texts have a one-to-one correspondence with new words, and each associated text is identified with is used to indicate the associated text corresponding to a new word.
  • the second word frequency threshold is 0.1
  • the average word frequency of the seed word "eat chicken” in this field is greater than the second word frequency threshold of 0.1. Therefore, the field seed word "eat chicken” can be determined as a field that meets the new word mining conditions in the field new word.
  • a method for determining whether a field seed word satisfies the new field word mining conditions based on the average word frequency is provided.
  • the average word frequency reaches the word frequency threshold, it is considered that the frequency of use of the field seed word is relatively high. High, so the domain seed words can be determined as new words in the domain, thereby improving the feasibility of the scheme.
  • the domain seed word satisfies the domain new word mining condition.
  • a method for determining whether a domain seed word satisfies a domain new word mining condition based on the maximum word frequency. First, it is necessary to obtain the domain seed words, and then determine whether the domain seed words belong to new domain words based on the associated text.
  • the domain seed words are usually manually entered words. For the specific entry method, please refer to the foregoing embodiments, which will not be repeated here.
  • the to-be-processed word frequency of the domain seed word "eat chicken” is calculated based on the associated text.
  • Table 3 is the domain seed word in the associated text.
  • the associated text here refers to one or more associated texts, for example, including Q associated texts (Q is an integer greater than or equal to 1), that is, associated texts have a one-to-one correspondence with new words, and each associated text is identified with is used to indicate the associated text corresponding to a new word.
  • the maximum word frequency of the field seed word "eat chicken” is 0.8.
  • the second word frequency threshold is 0.7
  • the average word frequency of the field seed word “eat chicken” is greater than the second word frequency threshold of 0.7.
  • the seed word "eat chicken” in this domain is determined as a domain neologism that satisfies the domain neologism mining conditions.
  • a method for determining whether a field seed word satisfies the new field word mining conditions based on the maximum word frequency is provided.
  • the maximum word frequency reaches the word frequency threshold, it is considered that the use frequency of the field seed word is relatively high. High, so the domain seed words can be determined as new words in the domain, thereby improving the feasibility of the scheme.
  • FIG. 8 is a schematic flowchart of mining domain new words in the embodiment of the application, as shown in the figure, specifically:
  • step A1 a statement is obtained from the domain corpus, wherein the statement may include M statements;
  • step A2 word segmentation is performed on the acquired statement to obtain a corresponding text sequence, wherein the text sequence may include M text sequences;
  • step A3 N-Gram is used to extract domain candidate words from the text sequence
  • step A4 calculate the domain candidate word feature of the domain candidate word
  • step A5 input this domain candidate word feature into the trained random forest model for prediction, and output the word quality score by the random forest model;
  • step A6 determine whether the word quality score of the domain candidate word is greater than or equal to the quality score threshold, if the word quality score is greater than or equal to the quality score threshold, then execute step A7, if the word quality score is less than the quality score threshold, then Execute step A8;
  • step A7 it is judged whether the word frequency of the domain candidate word is greater than or equal to the first word frequency threshold, if the word frequency of the domain candidate word is greater than or equal to the first word frequency threshold, then step A9 is executed, if the word frequency of the domain candidate word is less than the first word frequency threshold, then execute step A8;
  • step A8 it is determined that the field candidate words are meaningless words
  • step A9 determine that this field candidate word is a new word
  • step A10 obtain the field seed word from the field seed vocabulary
  • step A11 new words are used to search for associated text
  • step A12 based on the searched associated text, the average word frequency (or maximum word frequency) of the seed words in this field can be calculated;
  • step A13 determine whether the average word frequency (or maximum word frequency) of the field seed words is greater than or equal to the second word frequency threshold, if the average word frequency (or maximum word frequency) of the field seed words is greater than or equal to the second word frequency threshold, then execute the step A15, if the word frequency of the domain candidate word is less than the second word frequency threshold, then perform step A14;
  • step A14 it is determined that the new word is not a domain new word
  • step A15 it is determined that the new word is a domain new word.
  • each group of domain candidate word samples includes domain candidate word positive samples and domain candidate word negative samples.
  • the domain candidate word positive samples come from the positive sample pool, and the domain candidate word negative samples come from the negative sample pool.
  • K is an integer greater than or equal to 1;
  • the features of the K groups of domain candidate word samples are obtained according to the K groups of domain candidate word samples, wherein the domain candidate word sample features and the domain candidate word samples have a one-to-one correspondence, and each domain candidate word sample feature includes the corresponding domain candidate word positive samples.
  • K groups of prediction results are obtained through the text score prediction model to be trained, wherein the prediction results and the sample features of the domain candidate words have a one-to-one correspondence, and each group of prediction results includes the domain candidate words.
  • the training text score prediction model is trained until the model training conditions are met, and the text score prediction model is output;
  • the word quality scores corresponding to the domain candidate words are obtained, which specifically includes the following steps:
  • the word quality score corresponding to the domain candidate word is obtained through the text score prediction model.
  • a method for training a text score prediction model is introduced. Assuming that the text score prediction model to be trained is a decision tree model, then K is equal to 1; assuming that the text score prediction model to be trained is a random forest model, then K is equal to T, and K is an integer greater than 1.
  • the text score estimation model to be trained is a random forest model to be trained as an example, and each group of domain candidate word samples in the K groups of domain candidate word samples is used to train a decision tree.
  • each group of domain candidate word samples includes domain candidate word positive samples and domain candidate word negative samples, and the number of domain candidate word positive samples may be equal to the number of domain candidate word negative samples.
  • the corresponding domain candidate word sample features are extracted to obtain K groups of domain candidate word sample features.
  • Each domain candidate word sample feature includes the domain candidate word sample feature corresponding to the domain candidate word positive sample and the domain candidate word sample feature corresponding to the domain candidate word negative sample.
  • FIG. 9 is a schematic diagram of a training framework of the random forest model in the embodiment of the present application.
  • the text score estimation model to be trained is a random forest model, that is, K equals to T, divide the T group of domain candidate word samples into domain candidate word samples 1 to domain candidate word samples T, and then obtain the domain candidate word sample features corresponding to each group of domain candidate word samples respectively, that is, obtain domain candidate word sample features 1 to Domain candidate word sample features T.
  • the sample features of each group of domain candidate words are input into the decision tree in the random forest model to be trained, and each decision tree is independently trained, and each decision tree outputs the corresponding prediction result.
  • T decision trees are output, and the random forest model is obtained.
  • the text score estimation model to be trained may be a random forest model to be trained, a decision tree model, or other types of models.
  • FIG. 10 is a schematic flowchart of a training text score estimation model in the embodiment of the present application, as shown in the figure, specifically:
  • step B1 a statement is obtained from the domain corpus, wherein the statement may include S statements;
  • step B2 word segmentation is performed on the obtained statement to obtain a corresponding text sequence, wherein the text sequence may include S text sequences;
  • step B3 N-Gram is used to extract the domain candidate word for model training from the text sequence (that is, obtain the domain candidate word sample to be trained);
  • step B4 calculate the domain candidate word feature corresponding to the domain candidate word sample to be trained
  • step B5 use the general vocabulary database to classify candidate word samples in the field to be trained
  • step B6 if the candidate word sample of the field to be trained hits the general vocabulary database, the candidate word sample of the field to be trained is added to the positive sample pool;
  • step B7 if the candidate word sample of the field to be trained does not hit the general vocabulary database, the candidate word sample of the field to be trained is added to the negative sample pool;
  • step B8 the domain candidate words stored in the positive sample pool are used as the domain candidate word positive samples, and the domain candidate words stored in the negative sample pool are used as the domain candidate word negative samples, and the domain candidate word positive samples and the domain candidate words are used.
  • Negative samples are jointly trained to obtain a text score prediction model, for example, a random forest model is obtained by training.
  • a method for training a text score estimation model is provided.
  • positive and negative samples can be constructed using the accumulated general vocabulary and domain corpus, and then the text with supervised machine learning can be trained.
  • the score prediction model is used to predict the word quality score of the domain candidate words.
  • the selected text score prediction model can maximize the use of all the characteristics of the domain candidate words, and can adapt to the domain candidate word positive samples and domains that are not very accurate. Negative samples of candidate words can achieve the above effects by comprehensively considering the random forest model for learning.
  • the candidate word sample of the field to be trained hits the general vocabulary database, it is determined that the candidate word sample of the field to be trained belongs to the positive sample of the field candidate word, and the candidate word sample of the field to be trained is added to the positive sample pool;
  • candidate word samples of the domain to be trained do not hit the common word database, it is determined that the candidate word samples of the domain to be trained belong to the negative samples of the domain candidate words, and the candidate word samples of the domain to be trained are added to the negative sample pool.
  • a method for adding domain candidate word samples to the positive sample pool and the negative sample pool is introduced. Similar to the content introduced in the previous embodiment, in the process of training the text score estimation model, the sentence is obtained from the domain corpus, and the sentence here refers to one or more sentences, for example, including S sentences (S is greater than or an integer equal to 1). Then, the sentence is segmented to obtain a text sequence, and then domain candidate word samples are extracted from the text sequence. It should be noted that the sentences used for training and the sentences used for prediction may be exactly the same, may be partially the same, or may be completely different, which is not limited here.
  • a domain candidate word will be used as an example for introduction, and other domain candidate word samples are also determined in a similar manner to be added to the positive sample pool or to the negative sample pool, which will not be repeated here.
  • the extracted domain candidate word samples need to be compared with the general word database. If the domain candidate word samples appear in the general word database, it is considered that the domain candidate word samples belong to high-quality words and will hit the field of the general word database.
  • the candidate word samples are added to the positive sample pool, that is, it is determined that the domain candidate word samples belong to the domain candidate word positive samples.
  • the domain candidate word samples that do not hit the general word database are added to the negative sample pool, that is, it is determined that the domain candidate word samples belong to the domain candidate word negative samples. It is foreseeable that the number of negative samples of domain candidate words stored in the negative sample pool is much larger than the number of positive samples of domain candidate words stored in the positive sample pool.
  • a method for adding domain candidate word samples to the positive sample pool and the negative sample pool is provided.
  • the domain candidate word samples can be more accurately divided into positive and negative samples by using the general word library.
  • Sample pool or negative sample pool so as to facilitate subsequent training and improve the accuracy of training.
  • matching based on the common word database saves the process of manually dividing positive and negative samples and improves training efficiency.
  • FIG. 11 is an overall schematic flowchart of the text mining method in the embodiment of the present application. As shown in the figure, specifically:
  • step C1 a statement is obtained from the domain corpus, wherein the statement may include S statements;
  • step C2 word segmentation is performed on the acquired statement to obtain a corresponding text sequence, wherein the text sequence may include S text sequences;
  • N-Gram is used to extract the domain candidate word for model training from the text sequence (that is, obtain the domain candidate word sample to be trained);
  • step C4 calculate the domain candidate word feature corresponding to the domain candidate word sample to be trained
  • step C5 use the general vocabulary database to classify candidate word samples in the field to be trained
  • step C6 if the domain candidate word used for training hits the general word library, the domain candidate word sample to be trained is added to the positive sample pool;
  • step C7 if the candidate word sample of the field to be trained does not hit the general vocabulary database, the candidate word sample of the field to be trained is added to the negative sample pool;
  • step C8 the domain candidate words stored in the positive sample pool are used as the domain candidate word positive samples, and the domain candidate words stored in the negative sample pool are used as the domain candidate word negative samples, and the domain candidate word positive samples and the domain candidate words are used.
  • Negative samples are jointly trained to obtain a text score prediction model, for example, a random forest model is obtained by training;
  • step C9 N-Gram is used to extract domain candidate words from the text sequence
  • step C10 the domain candidate word feature of the domain candidate word is calculated, and then the domain candidate word feature is input into the trained text score prediction model (such as a random forest model) for prediction, and the text score prediction model is used for prediction. (such as random forest model) output word quality score;
  • the trained text score prediction model such as a random forest model
  • step C11 it is judged whether the word quality score of the domain candidate word is greater than or equal to the quality score threshold. If the word quality score is greater than or equal to the quality score threshold, step C12 is executed. If the word quality score is less than the quality score threshold, then Execute step C14;
  • step C12 it is judged whether the word frequency of the domain candidate word is greater than or equal to the first word frequency threshold, if the word frequency of the domain candidate word is greater than or equal to the first word frequency threshold, then step C145 is executed, if the word frequency of the domain candidate word is less than the first word frequency threshold, then execute step C14;
  • step C13 obtain the field seed word from the field seed vocabulary
  • step C14 it is determined that the field candidate words are meaningless words
  • step C15 it is determined that the field candidate word is a new word
  • step C16 new words are used to search for associated text
  • step C17 based on the searched associated text, the average word frequency (or maximum word frequency) of the seed words in this field can be calculated;
  • step C18 it is judged whether the average word frequency (or maximum word frequency) of the field seed words is greater than or equal to the second word frequency threshold, if the average word frequency (or maximum word frequency) of the field seed words is greater than or equal to the second word frequency threshold, then execute the step C20, if the word frequency of the domain candidate word is less than the second word frequency threshold, execute step C19;
  • step C19 it is determined that the new word is not a domain new word
  • step C20 it is determined that the new word is a domain new word.
  • the text mining device in the present application will be described in detail below. Please refer to FIG. 12 , which is a schematic diagram of an embodiment of the text mining device in the embodiment of the present application.
  • the text mining device 20 includes:
  • An obtaining module 201 is used to obtain the domain candidate word feature corresponding to the domain candidate word;
  • the obtaining module 201 is further configured to obtain the word quality score corresponding to the domain candidate word according to the feature of the domain candidate word;
  • a determination module 202 configured to determine a new word from the domain candidate word according to the word quality score corresponding to the domain candidate word;
  • the obtaining module 201 is also used to obtain the associated text according to the new word;
  • the determining module 202 is further configured to determine that the domain seed word is a domain new word if the domain seed word determined according to the associated text satisfies the domain new word mining condition.
  • an artificial intelligence-based text mining device which first obtains the domain candidate word features corresponding to the domain candidate words, and then obtains the word quality scores corresponding to the domain candidate words according to the domain candidate word features, Then, the new words are determined according to the word quality scores corresponding to the domain candidate words, and the associated text is obtained according to the new words. If it is determined according to the associated text that the domain seed word satisfies the domain new word mining conditions, the domain seed word is determined to be the domain new word.
  • new words can be screened out through domain candidate words based on machine learning algorithms, avoiding the process of manually setting a large number of feature thresholds, thereby reducing labor costs, and thus, can be well adapted to the rapid emergence of specialization in the Internet age. new word.
  • an acquisition module 201 specifically configured to acquire a statement from a domain corpus
  • a text mining device based on artificial intelligence is provided.
  • the above device is used to obtain sentences from a domain corpus, and then perform word segmentation processing on the sentences. This obtains relevant domain candidate words, and further extracts the domain candidate word features corresponding to each domain candidate word, thereby improving the feasibility and operability of the solution.
  • the obtaining module 201 is specifically configured to obtain the domain candidate words corresponding to the text sequence according to the word number sampling threshold and the character number sampling threshold, wherein the word number sampling threshold represents the upper limit of the number of words in the domain candidate words, and the number of characters The sampling threshold represents the upper limit of the number of characters in the domain candidate word.
  • a text mining device based on artificial intelligence is provided.
  • the N-Gram algorithm can be used to evaluate whether a sentence is reasonable or not, and can also be used to evaluate the degree of difference between two character strings.
  • the N-gram algorithm contains all the information that the previous words can provide. These words have a strong binding force on the appearance of the current word, which is conducive to extracting more accurate and richer domain candidate words.
  • the acquisition module 201 is specifically used for the acquisition module, and is specifically used for acquiring the domain candidate word features corresponding to the domain candidate words according to the text sequence, wherein the domain candidate word features include word frequency, word frequency inverse document frequency TFIDF value, degree of freedom, degree of solidification, At least one of left information entropy and right information entropy.
  • a text mining device based on artificial intelligence is provided.
  • the feature quantification of domain candidate words can be performed, and the extraction from the dimensions of word weight, word position in the document, and word related information can be extracted.
  • the relevant features of the domain candidate words constitute the domain candidate word features.
  • the domain candidate word features can well express the characteristics of the domain candidate words and help to obtain more accurate domain candidate word evaluation results.
  • the text score estimation model is a random forest model, wherein the random forest model Including T decision trees, T is an integer greater than 1;
  • the obtaining module 201 is specifically configured to obtain the sub-score corresponding to the feature of the domain candidate word through the decision tree included in the random forest model based on the feature of the domain candidate word;
  • the word quality score corresponding to the domain candidate word is obtained.
  • a text mining device based on artificial intelligence is provided.
  • the word quality score predicted by the random forest model has a high accuracy rate, and can be effectively evaluated through multiple decision trees.
  • the determining module 202 is specifically configured to determine that the domain candidate word belongs to a new word if the word quality score corresponding to the domain candidate word is greater than or equal to the quality score threshold;
  • the word quality score corresponding to the domain candidate word is less than the quality score threshold, it is determined that the domain candidate word does not belong to a new word.
  • a text mining device based on artificial intelligence is provided.
  • a domain candidate word with a higher word quality score is used as a new word, which can ensure that the new word has a higher quality score to a certain extent.
  • Quality which can be used as a candidate for new words in the domain, thereby improving the reliability and accuracy of newly selected new words.
  • the determining module 202 is specifically configured to obtain the word frequency corresponding to the domain candidate word
  • the word quality score corresponding to the domain candidate word is greater than or equal to the quality score threshold, and the word frequency corresponding to the domain candidate word is greater than or equal to the first word frequency threshold, it is determined that the domain candidate word is a new word;
  • the word quality score corresponding to the domain candidate word is less than the quality score threshold, or the word frequency corresponding to the domain candidate word is less than the first word frequency threshold, it is determined that the domain candidate word does not belong to a new word.
  • a text mining device based on artificial intelligence is provided.
  • a domain candidate word with a higher word quality score is used as a new word, which can ensure that the new word has a higher quality score to a certain extent.
  • Quality which can be used as a candidate for new words in the domain, thereby improving the reliability and accuracy of newly selected new words.
  • selecting words with higher word frequency as new words can ensure that new words have a higher transmission rate to a certain extent and are more in line with the definition of new words.
  • the obtaining module 201 is specifically configured to obtain the search feedback result corresponding to the new word through the search engine, wherein the search feedback result includes at least one search result;
  • the first R search results with the highest relevance are determined from at least one search result as the associated text corresponding to the new word, where R is an integer greater than or equal to 1.
  • a text mining device based on artificial intelligence is provided.
  • the search feedback result of the search engine is used as a criterion for evaluating the frequency of use of new words, which can be closer to the actual situation of the use of new words. Helps to find relevant texts related to new words in the field.
  • the obtaining module 201 is further configured to obtain the domain seed word after obtaining the associated text according to the new word;
  • the determining module 202 is further configured to determine the average word frequency of the field seed words according to the associated text;
  • the determining module 202 is further configured to determine that the domain seed word satisfies the domain new word mining condition if the average word frequency is greater than or equal to the second word frequency threshold.
  • a text mining device based on artificial intelligence is provided. Using the above device, if the average word frequency reaches the word frequency threshold, it is considered that the use frequency of the field seed words is high, and the field seed words can be determined as New words in the field to improve the feasibility of the program.
  • the obtaining module 201 is further configured to obtain the domain seed word after obtaining the associated text according to the new word;
  • the determining module 202 is further configured to determine the maximum word frequency of the field seed word according to the associated text;
  • the determining module 202 is further configured to determine that the domain seed word satisfies the domain new word mining condition if the maximum word frequency is greater than or equal to the second word frequency threshold.
  • a text mining device based on artificial intelligence is provided.
  • the maximum word frequency reaches the word frequency threshold, it is considered that the use frequency of the field seed word is high, and the field seed word can be determined as New words in the field to improve the feasibility of the program.
  • the text mining apparatus 20 further includes a training module 203;
  • the obtaining module 201 is further configured to obtain K groups of domain candidate word samples, wherein each group of domain candidate word samples includes a domain candidate word positive sample and a domain candidate word negative sample, the domain candidate word positive sample comes from a positive sample pool, and the domain candidate word sample Negative samples come from the negative sample pool, and K is an integer greater than or equal to 1;
  • the obtaining module 201 is further configured to obtain K groups of domain candidate word sample features according to the K groups of domain candidate word samples, wherein the domain candidate word sample features and the domain candidate word samples have a one-to-one correspondence, and each domain candidate word sample feature includes: The domain candidate word sample features corresponding to the domain candidate word positive samples and the domain candidate word sample features corresponding to the domain candidate word negative samples;
  • the acquisition module 201 is further configured to obtain K groups of prediction results through the text score estimation model to be trained based on the sample features of the K groups of candidate words in the field, wherein the prediction results and the sample features of the field candidate words have a one-to-one correspondence, and each group is in a one-to-one correspondence.
  • the prediction result includes the predicted label of the positive sample of the domain candidate word and the predicted label of the negative sample of the domain candidate word;
  • the training module 203 is used to train the text score prediction model to be trained according to the K groups of prediction results and the K groups of domain candidate word samples, until the model training conditions are met, and output the text score prediction model;
  • the obtaining module 201 is specifically configured to obtain the word quality score corresponding to the domain candidate word through the text score estimation model based on the feature of the domain candidate word.
  • a text mining device based on artificial intelligence is provided.
  • positive and negative samples can be constructed using the accumulated general vocabulary and domain corpus, and then the text score prediction with supervised machine learning can be trained.
  • the model predicts the word quality score of the domain candidate words.
  • the selected text score prediction model can maximize the use of all the characteristics of the domain candidate words, and can adapt to the not very accurate domain candidate word positive samples and domain candidate word negative samples. , comprehensively considering the use of random forest model for learning, the above effect can be achieved.
  • the text mining apparatus 20 further includes a processing module 204;
  • the obtaining module 201 is further configured to obtain sentences from the domain corpus before obtaining the K groups of domain candidate word samples;
  • the processing module 204 is used to perform word segmentation processing on the sentence to obtain a text sequence
  • the obtaining module 201 is further configured to obtain candidate word samples in the field to be trained according to the text sequence;
  • the determining module 202 is further configured to determine that the candidate word sample of the field to be trained belongs to the positive sample of the field candidate word if the candidate word sample of the field to be trained hits the general vocabulary database, and add the candidate word sample of the field to be trained to the positive sample pool;
  • the determining module 202 is further configured to determine that the domain candidate word sample to be trained belongs to the domain candidate word negative sample if the to-be-trained domain candidate word sample does not hit the common word database, and add the to-be-trained domain candidate word sample to the negative sample pool.
  • a text mining device based on artificial intelligence is provided.
  • the domain candidate word samples can be more accurately divided into positive sample pools or negative sample pools by using the general vocabulary database, so as to facilitate subsequent training, And it helps to improve the accuracy of training.
  • matching based on the common word database saves the process of manually dividing positive and negative samples and improves training efficiency.
  • FIG. 13 is a schematic structural diagram of a server provided by an embodiment of the present application.
  • the server 300 may vary greatly due to different configurations or performance, and may include one or more central processing units (CPUs) 322 (for example, , one or more processors) and memory 332, one or more storage media 330 (eg, one or more mass storage devices) that store applications 342 or data 344.
  • the memory 332 and the storage medium 330 may be short-term storage or persistent storage.
  • the program stored in the storage medium 330 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the server.
  • the central processing unit 322 may be configured to communicate with the storage medium 330 to execute a series of instruction operations in the storage medium 330 on the server 300 .
  • Server 300 may also include one or more power supplies 326, one or more wired or wireless network interfaces 350, one or more input and output interfaces 358, and/or, one or more operating systems 341, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM and so on.
  • the steps performed by the server in the above embodiment may be based on the server structure shown in FIG. 13 .
  • the embodiment of the present application also provides another image text mining apparatus, and the text mining apparatus can be deployed on a terminal device.
  • a terminal device can be any terminal device including a mobile phone, a tablet computer, a personal digital assistant (Personal Digital Assistant, PDA), a sales terminal device (Point of Sales, POS), a vehicle-mounted computer, etc.
  • PDA Personal Digital Assistant
  • POS Point of Sales
  • vehicle-mounted computer etc.
  • the terminal device is a mobile phone as an example:
  • FIG. 14 is a block diagram showing a partial structure of a mobile phone related to a terminal device provided by an embodiment of the present application.
  • the mobile phone includes: a radio frequency (RF) circuit 410 , a memory 420 , an input unit 430 , a display unit 440 , a sensor 450 , an audio circuit 460 , a wireless fidelity (WiFi) module 470 , and a processor 480 , and power supply 490 and other components.
  • RF radio frequency
  • the input unit 430 may be used for receiving inputted numerical or character information, and generating key signal input related to user setting and function control of the mobile phone.
  • the input unit 430 may include a touch panel 431 and other input devices 432 .
  • the display unit 440 may be used to display information input by the user or information provided to the user and various menus of the mobile phone.
  • the display unit 440 may include a display panel 441 .
  • the audio circuit 460, the speaker 461, and the microphone 462 can provide an audio interface between the user and the mobile phone.
  • the processor 480 included in the terminal device also has the following functions:
  • the domain seed word is determined to be the domain new word.
  • processor 480 is further configured to execute the methods described in the foregoing embodiments.
  • Embodiments of the present application also provide a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when it runs on a computer, causes the computer to execute the methods described in the foregoing embodiments.
  • the embodiments of the present application also provide a computer program product including a program, which, when run on a computer, causes the computer to execute the methods described in the foregoing embodiments.
  • the disclosed system, apparatus and method may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium.
  • the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the related technology, or all or part of the technical solution, and the computer software product is stored in a storage medium.
  • a computer device which may be a personal computer, a server, or a network device, etc.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Medical Informatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本申请公开了一种基于人工智能的文本挖掘方法,该方法可涉及大数据领域,本申请包括:获取领域候选词所对应的领域候选词特征;根据领域候选词特征,获取领域候选词所对应的词质量分值;根据领域候选词所对应的词质量分值确定新词;根据新词获取关联文本;若根据关联文本确定领域种子词满足领域新词挖掘条件,则确定领域种子词为领域新词。本申请可以基于机器学习算法自动从领域候选词中筛选出新词,避免了人工设定大量特征阈值的过程,从而降低了人工成本,由此,能够很好地适应互联网时代快速出现的特异化新词。

Description

一种基于人工智能的文本挖掘方法、相关装置及设备
本申请要求于2020年09月22日提交中国专利局、申请号为202011001027.4、申请名称为“一种基于人工智能的文本挖掘方法、相关装置及设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及自然语言处理领域以及大数据处理领域,尤其涉及文本挖掘。
背景技术
在自然语言处理(Nature Language processing,NLP)研究领域中,词语一直是重要的研究对象。在中文环境下,词语之间并无明显的分割字符,主要借助已有词库和统计规则进行分词。随着社会和社交网络的发展,人们使用语言文字的习惯也发生着变化,这使得新词层出不穷。为此,新词发现成为NLP处理中一项重要的任务。
目前,可基于统计方法发现新词,该方法首先需要获取候选词,然后根据候选词统计特征值得出成词概率,在实践中,通常结合凝固度和自由度作为候选词的统计特征,即选择凝固度和自由度超过一定特征阀值的候选词作为新词。
发明内容
本申请实施例提供了一种基于人工智能的文本挖掘方法、相关装置及设备,可以采用机器学习算法通过领域候选词筛选出新词,避免了人工设定大量特征阈值的过程,从而降低了人工成本,由此,能够很好地适应互联网时代快速出现的特异化新词。
有鉴于此,本申请一方面提供一种基于人工智能的文本挖掘方法,包括:
获取领域候选词所对应的领域候选词特征;
根据领域候选词特征,获取领域候选词所对应的词质量分值;
根据领域候选词所对应的词质量分值确定新词;
根据新词获取关联文本;
若根据关联文本确定领域种子词满足领域新词挖掘条件,则确定领域种子词为领域新词。
本申请另一方面提供一种文本挖掘装置,包括:
获取模块,用于获取领域候选词所对应的领域候选词特征;
获取模块,还用于根据领域候选词特征,获取领域候选词所对应的词质量分值;
确定模块,用于根据领域候选词所对应的词质量分值确定新词;
获取模块,还用于根据新词获取关联文本;
确定模块,还用于若根据关联文本确定领域种子词满足领域新词挖掘条件,则确定领域种子词为领域新词。
本申请另一方面提供一种计算机设备,包括:存储器、收发器、处理器以及总线系统;
其中,存储器用于存储程序;
处理器用于执行存储器中的程序,包括执行上述各方面的方法;
总线系统用于连接存储器以及处理器,以使存储器以及处理器进行通信。
本申请的另一方面提供了一种计算机可读存储介质,计算机可读存储介质中存储有计算机程序,所述计算机程序用于执行上述方面的方法。
本申请的另一个方面,提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述方面所提供的方法。
本申请的又一方面,提供了一种包括指令的计算机程序产品,当其在计算机上运行时,使得所述计算机执行上述方面所提供的方法。
从以上技术方案可以看出,本申请实施例具有以下优点:
本申请实施例中,提供了一种基于人工智能的文本挖掘方法,首先获取领域候选词所对应的领域候选词特征,然后根据领域候选词特征,获取领域候选词所对应的词质量分值,再根据领域候选词所对应的词质量分值确定新词,根据新词获取关联文本。如果根据关联文本确定领域种子词满足领域新词挖掘条件,则确定领域种子词为领域新词。通过上述方式,可以基于机器学习算法通过领域候选词筛选出新词,避免了人工设定大量特征阈值的过程,从而降低了人工成本,由此,能够很好地适应互联网时代快速出现的特异化新词。
附图说明
图1为本申请实施例中文本挖掘系统的一个架构示意图;
图2为本申请实施例中基于大数据生成领域语料库的一个示意图;
图3为本申请实施例中基于人工智能的文本挖掘方法一个实施例示意图;
图4为本申请实施例中基于决策树生成子分值的一个结构示意图;
图5为本申请实施例中基于随机森林模型生成词质量分值的一个示意图;
图6为本申请实施例中通过搜索引擎展示搜索反馈结果的一个界面示意图;
图7为本申请实施例中人工录入领域种子词的一个界面示意图;
图8为本申请实施例中挖掘领域新词的一个流程示意图;
图9为本申请实施例中随机森林模型的一个训练框架示意图;
图10为本申请实施例中训练文本分值预估模型的一个流程示意图;
图11为本申请实施例中文本挖掘方法的一个整体流程示意图;
图12为本申请实施例中文本挖掘装置的一个实施例示意图;
图13为本申请实施例中服务器的一个结构示意图;
图14为本申请实施例中终端设备的一个结构示意图。
具体实施方式
本申请实施例提供了一种基于人工智能的文本挖掘方法、相关装置及设备,可以采用机器学习算法通过领域候选词筛选出新词,避免了人工设定大量特征阈值的过程,从而降低了人工成本,由此,能够很好地适应互联网时代快速出现的特异化新词。
随着微博等各种形式的社交网络媒体或平台的兴起,每天的热门事件更加聚焦,同时文本内容中包括的文字也逐渐趋于口语化,由此,产生了大量的以前从未出现的新词,新词有些是全新的文字组合,有些是旧的词语有了新的含义,因此,新词的发现成为NLP中一项重要的任务。如何及时准确发现这些新词,对于追踪实时热点、改进分词及索引效果等都具有重要意义。基于此,本申请提供了一种基于人工智能的文本挖掘方法,用于发掘新词,且能够进一步发掘领域新词。
应理解,本申请提供了的文本挖掘方法应用于人工智能(Artificial Intelligence,AI)领域,具体涉及自然语言处理技术以及机器学习(Machine Learning,ML)。
为了实现新词以及领域新词的挖掘,本申请提出了一种基于人工智能的文本挖掘方法,该方法应用于图1所示的文本挖掘系统,如图所示,文本挖掘系统包括服务器和终端设备,客户端部署于终端设备上,文本挖掘平台部署于作为文本挖掘设备的服务器上。
需要说明的是,本申请涉及的服务器可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(Content  Delivery Network,CDN)、以及大数据和人工智能平台等基础云计算服务的云服务器。终端设备可以是智能手机、平板电脑、笔记本电脑、掌上电脑、个人电脑、智能电视、智能手表等,但并不局限于此。终端设备以及服务器可以通过有线或无线通信方式进行直接或间接地连接,本申请在此不做限制。服务器和终端设备的数量也不做限制。
进一步地,请参阅图2,图2为本申请实施例中基于大数据生成领域语料库的一个示意图,如图所示,文本挖掘平台接入数据平台中的数据,由此获取文件,其中,该文件可以是用户通过浏览器访问过的网络文件,或者通过网页爬虫技术不间断地从各个网站抓取到的网络文件。再将收集到的文件按照收集时间进行排序,并对每个文件的内容进行解析,提取符合条件的文本加入领域语料库。还可以对领域语料库中的文本进行分词处理,领域候选词提取处理以及情感分析处理等,进而实现新词发现、内容匹配以及词库匹配等操作,发现的新词可加入至词库中,词库包括但不仅限于行业词库、情感词库以及垃圾词库。基于新词发现的结果还可以进行主题统计、热词统计、情感分析以及内容分类等,最终实现数据的应用。
其中,数据平台可提供大数据(big data),大数据属于云技术(cloud technology)中的一个分支,其中,云技术是指在广域网或局域网内将硬件、软件、网络等系列资源统一起来,实现数据的计算、储存、处理和共享的一种托管技术。
在介绍本申请提供的基于人工智能的文本挖掘方法之前,先简单介绍一下本申请涉及到的一些技术以及相关术语:
1、领域新词:主要应用于某个领域中的专有词语或常用词语,例如,“王者荣耀”以及“吃鸡”等均属于游戏类领域的新词。又例如,某个公司发布一款新的游戏,名为“救地鼠”,这个新的游戏以前是没有的,那么“救地鼠”就是一个领域新词。
2、领域种子词:主要是指在领域文本经常出现的,在一定程度上能够代表领域含义的词语。比如,对于手机游戏类领域中涉及的文本内容,领域种子词可以是“手机”、“游戏”、“手游”、“游戏应用”等。
3、分词:就是将连续的字序列按照一定的规范重新组合成词序列的过程。现有的分词算法可分为三大类:基于字符串匹配的分词方法,基于理解的分词方法以及基于统计的分词方法。按照是否与词性标注过程相结合,又可以分为单纯分词方法和分词与标注相结合的一体化方法。
4、远程监督学习:在本申请中,是指使用通用领域或某一领域的词库指导其它领域的新词挖掘和发现。
5、语言模型(N-Gram):是一种基于统计语言模型的算法,它的基本思想是将文本里面的内容按照单字单词进行大小为N的滑动窗口操作,形成长度是N的字符序列。在本申请中,可以使用N-Gram算法生成领域候选词。
6、随机森林:是一种由决策树构成的集成学习算法,每个决策树独立的对事件进行预测,而最终结果由所有预测结果加权确定。
7、正样本池:由领域候选词正样本组成的。在模型训练时,训练数据从该正样本池中抽取领域候选词正样本。
8、负样本池:由大量领域候选词负样本和很少的可能的领域候选词正样本所组成的。在模型训练时,训练数据从该负样本池中抽取领域候选词负样本。
9、凝固度:表示一个领域候选词内部组成字符之间的紧密程度,一般使用字符固定搭配的后验概率进行度量。例如,“怕”和“上火”同时使用的程度,如果“怕”只和“上火”搭配使用,则两者的凝固度高,即表示一个词。计算凝固度需要先计算P(“怕上火”)、P(“怕”)和P(“上火”)的概率,这个概率是在领域候选词中的出现概率。凝固度(“怕” 和“上火”)=P(“怕上火”)/(P(“怕”)*P(“上火”)。如果只有“怕上火”这一种组合,则P(“怕上火”)、P(“怕”)和P(“上火”)的概率应该一样的,即凝固度等于1。如果除了“怕上火”,还有“怕蟑螂”等,那凝固度就会变小。
10、自由度:表示一个领域候选词能够独立自由运用的程度,一般使用词语左右信息熵度量自由度。例如,“巧克力”里面的“巧克”的凝固度就很高,和“巧克力”的凝固度一样高,但是它自由运用的程度几乎为零,所以“巧克”不能单独成词。
11、词频(term frequency,TF):某个给定领域候选词在文本中出现的频率,即领域候选词在文本中出现的总次数与文本包含的总领域候选词的比值。
12、逆文档频率(inverse document frequency,IDF):表示某个给定领域候选词重要性的度量,即首先算出总语句的条目数与包含该领域候选词的语句条目数的比值,再将得到比值求以10为底的对数即可得到逆文档频率。
13、词频-逆文档频率(term frequency–inverse document frequency,TFIDF)值:一种用于信息检索与数据挖掘的常用加权技术,取值为词频(TF)与逆文档频率(IDF)的乘积。TFIDF值可用以评估一个词对于一个文件或者一个语料库中的一个领域文件集的重复程度。
14、左侧信息熵:用于衡量领域候选词左侧搭配的丰富程度。使用公式计算,其中,x是领域候选词左侧所有可能搭配的(即随机变量)。左侧可能搭配的就是在分析的内容里面,领域候选词紧邻左侧出现过的所有单词。例如,“你好,小白兔”,“哈哈,你好,小白兔”,“在干什么,小白兔”,“小白兔”左侧所有可能搭配是“你好”以及“在干什么”。信息熵的计算公式如下:
Figure PCTCN2021102745-appb-000001
其中,H(x)表示随机变量x的信息熵,p(x i)表示第i个随机事件的概率,n表示随机事件的总数。
15、右侧信息熵:用于衡量领域候选词右侧搭配的丰富程度。使用公式计算,其中,x是领域候选词右侧所有可能搭配的(即随机变量)。右侧可能搭配的就是在分析的内容里面,领域候选词紧邻右侧出现过的所有单词。
结合上述介绍,下面将对本申请中基于人工智能的文本挖掘方法进行介绍,请参阅图3,本实施例可以通过文本挖掘设备执行,本申请实施例中人工智能的文本挖掘方法的一个实施例包括:
101、获取领域候选词所对应的领域候选词特征;
本实施例中,文本挖掘设备首先从领域语料库中获取大量的语句(例如,十万条语句),然后对每条语句进行分词后即可得到领域候选词,这里的领域候选词是指一个或多个领域候选词,例如,包括P个领域候选词(P为大于或等于1的整数)。领域候选词之间不重复,且每个领域候选词可提取一个对应的领域候选词特征。
需要说明的是,文本挖掘设备可以为服务器,也可以为终端设备,本申请不做限定。且可以理解的是,本申请中涉及的领域候选词可以为一个词语,也可以为包括至少两个词语的集合。
102、根据领域候选词特征,获取领域候选词所对应的词质量分值;
本实施例中,文本挖掘设备将领域候选词特征作为文本分值预估模型的输入,由文本分值预估模型输出领域候选词特征所对应的词质量分值,即词质量分值与领域候选词也具 有对应关系。其中,词质量分值越高,表示该领域候选词属于高质量词语的可能性越大,高质量词语表示该词语具有合理的语义,例如,“巧克力”即为一个高质量词语,而“克力”没有完整且合理的语义,故不属于高质量词语。
103、根据领域候选词所对应的词质量分值,从领域候选词中确定新词;
本实施例中,文本挖掘设备可根据领域候选词所对应的词质量分值,从领域候选词中筛选出新词。这里的新词是指一个或多个新词,例如,包括Q个新词(Q为大于或等于1的整数)。例如,领域候选词“打王者”的词质量分值达到质量分阈值,即可将领域候选词“打王者”作为新词。此外,为了保证新词具有一定的普及型,还需要判断领域候选词的出现频率是否足够高,如果领域候选词的出现频率是达到阈值,则认为该领域候选词已经达到了一定的普及型,因此,可确定该领域候选词属于新词。反之,如果领域候选词的出现频率是未到达阈值,则表示该领域候选词可能并非通用的词语,即确定该领域候选词不属于新词。
104、根据新词获取关联文本;
本实施例中,文本挖掘设备根据新词,从搜索引擎中爬取相关的关联文本。需要说明的是,关联文本可以理解为是一组文本,或者为包括至少两组文本的集合,每个新词可爬取一个关联文本。关联文本可体现为文档的形式,每个关联文本中记录了多条语句。
105、若根据关联文本确定领域种子词满足领域新词挖掘条件,则确定领域种子词为领域新词。
本实施例中,文本挖掘设备还需要从领域种子词库中获取领域种子词,然后计算该领域种子词在关联文本中的出现概率,如果出现概率达到阈值,则表示满足领域新词挖掘条件,于是可以将该领域种子词标记为领域新词,反之,如果出现概率未达到阈值,则表示不满足领域新词挖掘条件,即认为该领域种子词不属于领域新词。
基于此,假设领域种子词库中有5000个领域种子词,可分别计算这5000个领域种子词在关联文本中的出现概率,进而判断每个领域种子词是否满足领域新词挖掘条件,如果满足领域新词挖掘条件,即可将该领域种子词确定为领域新词。
本申请提供的文本挖掘方法可应用于社交网络群组名称短文本上的新词发现,测试排列在前100的新词准确率达到92.7%,领域新词的准确率达到82.4%。测试整体新词的准确率达到84.5%,领域新词准确率达到67.2%。由此可见,本申请提供的基于人工智能的文本挖掘方法,能够更好地挖掘出领域新词。
本申请实施例中,提供了一种基于人工智能的文本挖掘方法,首先获取领域候选词所对应的领域候选词特征,然后根据领域候选词特征,获取领域候选词所对应的词质量分值,再根据领域候选词所对应的词质量分值确定新词,根据新词获取关联文本。如果根据关联文本确定领域种子词满足领域新词挖掘条件,则确定领域种子词为领域新词。通过上述方式,可以基于机器学习算法通过领域候选词筛选出新词,避免了人工设定大量特征阈值的过程,从而降低了人工成本,由此,能够很好地适应互联网时代快速出现的特异化新词。
可选地,在上述图3对应的各个实施例的基础上,在本申请实施例提供的文本挖掘方法的一个可选实施例中,在获取领域候选词所对应的领域候选词特征之前,具体还包括如下步骤:
从领域语料库中获取语句;
对语句中的每个语句进行分词处理,得到文本序列;
根据文本序列获取领域候选词。
本实施例中,介绍了一种确定领域候选词的方法,文本挖掘设备从领域语料库中获取语句,这里的语句是指一条或多条语句,例如,包括M条语句(M为大于或等于1的整数)。 其中,语料库中存放的是在语言的实际使用中真实出现过的语言材料,语料库是以电子计算机为载体承载语言知识的基础资源,真实语料需要经过分析和处理,才能成为有用的资源。而领域语料库是针对某个领域的语料库,例如,游戏类领域的语料库,或者,医疗类领域的语料库等,本申请不对领域语料库的类型进行限定。
文本挖掘设备对来源于领域语料库的语句分别进行分词处理,得到对应的文本序列。对于中文分词而言,可以采用基于词典分词算法或者基于机器学习算法来实现,基于词典的分词算法包括正向最大匹配法、逆向最大匹配法和双向匹配分词法等。基于机器学习算法包括条件随机场(conditional random field,CRF)、隐马尔可夫模型(Hidden Markov Model,HMM)以及支持向量机(Support Vector Machine,SVM)等。
示例性地,以语句“北京福娃是奥运会吉祥物”为例,对该语句分词后得到文本序列为“北京/福娃/是/奥运会/吉祥物”,其中,“/”表示词语之间的分隔符。基于此,可以从文本序列中提取至少一个领域候选词,以文本序列“北京/福娃/是/奥运会/吉祥物”为例,至少可以提取到领域候选词,即为“北京”、“福娃”、“是”、“奥运会”和“吉祥物”。需要说明的是,还可以采用N-Gram算法从文本序列中提取领域候选词,或者采用有监督算法从文本序列中提取领域候选词,或者采用半监督算法从文本序列中提取领域候选词,又或者采用无监督算法从文本序列中提取领域候选词等,此处不做限定。
基于此,可以统计领域候选词在语句中的词频、TFIDF值、凝固度、自由度、左侧信息熵、右侧信息熵、词长、互信息、位置信息以及词跨度等指标,将其中一个或多个指标作为领域候选词所对应的领域候选词特征。
其次,本申请实施例中,提供了一种提取领域候选词特征的方法,通过上述方式,从领域语料库中获取语句,然后对语句进行分词处理,这些分词后的文本序列作为领域候选词的来源,由此获取相关的领域候选词,再进一步提取每个领域候选词所对应的领域候选词特征,从而提升方案的可行性和可操作性。
可选地,在上述图3对应的各个实施例的基础上,在本申请实施例提供的文本挖掘方法的另一个可选实施例中,根据文本序列获取领域候选词,具体包括如下步骤:
根据词数采样阈值以及字符数采样阈值,获取文本序列所对应的领域候选词,其中,词数采样阈值表示所述领域候选词中词语数量的上限值,字符数采样阈值表示所述领域候选词中字符数量的上限值。
本实施例中,介绍了一种基于N-Gram算法获取领域候选词的方法,使用N-Gram算法在文本序列上进行采样,从而得到领域候选词。N-Gram算法涉及到两个超参数,分别为词数采样阈值(N)以及字符数采样阈值(maxLen),其中,词数采样阈值用于控制最多可以选择几个词语进行组合,即表示领域候选词的词语数量上限值。字符数采样阈值用于控制领域候选词的最大长度,即领域候选词中字符数量的上限值。例如,词数采样阈值N=3,字符数采样阈值maxLen=10,即表示领域候选词最大字符长度限制在10个以内,且由1个词语、2个连续词语或3个连续词语组成。
具体地,以文本序列“北京/福娃/是/奥运会/吉祥物”为例,假设词数采样阈值N=3,字符数采样阈值maxLen=6,由此得到如下领域候选词:
{北京},{福娃},{是},{奥运会},{吉祥物},{北京福娃},{福娃是},{是奥运会},{奥运会吉祥物},{北京福娃是},{福娃是奥运会}。
再次,本申请实施例中,提供了一种基于N-Gram算法获取领域候选词的方法,通过上述方式,既可以利用N-Gram算法来评估一个语句是否合理,又可以用于评估两个字符串之间的差异程度,N-gram算法包含了前若干个词语所能提供的全部信息,这些词语对于当前词语的出现具有很强的约束力,有利于提取更准确且更丰富的领域候选词。
可选地,在上述图3对应的各个实施例的基础上,在本申请实施例提供的文本挖掘方法的另一个可选实施例中,根据领域候选词获取领域候选词特征,具体包括如下步骤:
根据文本序列获取领域候选词所对应的领域候选词特征,其中,领域候选词特征包括词频、词频逆文档频率TFIDF值、自由度、凝固度、左侧信息熵以及右侧信息熵中的至少一项。
本实施例中,介绍了一种提取领域候选词特征的方法,针对领域候选词中的每个领域候选词,均可以提取该领域候选词对应的词频、TFIDF值、自由度、凝固度、左侧信息熵以及右侧信息熵等。下面将以领域候选词“福娃”为例,介绍获取领域候选词特征的方式。
一、词频的计算方式;
领域候选词“福娃”的词频表示该领域候选词在语句(或文本序列)中出现的概率,通常情况下,如果一个词语在文本中出现的越是频繁,那么这个词语就越有可能是核心词。假设领域候选词“福娃”在语句(或文本序列)中出现了m次,且语句(或文本序列)中的总词语数为n,即领域候选词“福娃”的词频计算方式为:
Figure PCTCN2021102745-appb-000002
其中,w表示领域候选词“福娃”,TF w表示领域候选词“福娃”的词频,m表示领域候选词“福娃”在语句(或文本序列)中出现的次数,n表示语句(或文本序列)中的总词语数。
二、TFIDF值的计算方式;
领域候选词“福娃”的TFIDF值由两部分计算得到,分别为词频以及逆文档频率。领域候选词“福娃”的逆文档频率表示该领域候选词在领域语料库中出现的频率,假设领域语料库中包括领域候选词“福娃”的语句有X个,而领域语料库中的语句总数为Y,则领域候选词“福娃”的逆文档频率计算方式为:
Figure PCTCN2021102745-appb-000003
其中,w表示领域候选词“福娃”,IDF w表示领域候选词“福娃”的逆文档频率,X表示领域语料库中包括领域候选词“福娃”的语句数量,Y表示领域语料库中包括领域候选词“福娃”的语句总数。
由此,得到领域候选词“福娃”的TFIDF值计算方式为:
TFIDF w=TF w×IDF w
其中,w表示领域候选词“福娃”,TF w表示领域候选词“福娃”的词频,IDF w表示领域候选词“福娃”的逆文档频率。
三、自由度的计算方式;
领域候选词“福娃”可以用熵来衡量自由度,假设领域候选词“福娃”一共出现了m次,“福娃”的左边共出现过d个汉字,每个汉字依次出现d1,d2,…,dm次,即满足m=d1+d2+…+dm,因此,可以计算领域候选词“福娃”左边各个汉字出现的概率,并根据熵公式计算 左侧信息熵。类似地,可以计算领域候选词“福娃”右边各个汉字出现的概率,并根据熵公式计算右侧信息熵,将一个词语左邻熵和右邻熵中较小者作为最终的自由度,熵越小则自由度越低。
四、凝固度的计算方式;
计算领域候选词“福娃”的凝固度,首先需要计算“福”字的概率,“娃”字的概率和“福娃”字的概率,即得到P(“福”)、P(“娃”)和P(“福娃”)。这里的概率是在领域候选词中的出现概率,得到领域候选词“福娃”的凝固度计算方式为,凝固度(“福”和“娃”)=P(“福”))/(P(“福”))*P(“娃”)。
五、左侧信息熵和右侧信息熵的计算方式;
领域候选词“福娃”的左侧信息熵或右侧信息熵的计算方式为:
Figure PCTCN2021102745-appb-000004
其中,H(w)表示领域候选词“福娃”的信息熵,p(w i)表示第i个领域候选词“福娃”的概率,C表示随机事件的总数。
再次,本申请实施例中,提供了一种提取领域候选词特征的方法,通过上述方式,可以对领域候选词进行特征量化,从词语权重,词语在文档中的位置以及词语的关联信息等维度,提取领域候选词的相关特征,由此构成领域候选词特征,该领域候选词特征能够很好地表达领域候选词的特点,有助于得到更准确的领域候选词评价结果。
可选地,在上述图3对应的各个实施例的基础上,在本申请实施例提供的文本挖掘方法的另一个可选实施例中,根据领域候选词特征,获取领域候选词所对应的词质量分值,具体包括如下步骤:
基于领域候选词特征,通过随机森林模型所包括的决策树,获取领域候选词特征所对应的子分值;
根据领域候选词特征所对应的子分值,获取领域候选词所对应的词质量分值。
本实施例中,介绍了一种利用随机森林模型输出词质量分值的方法。文本分值预估模型可以为决策树模型、梯度提升决策树(Gradient Boosting Decision Tree,GBDT)、梯度提升(XGBoost)算法或者随机森林(Random Forest,RF)模型等,本申请以文本分值预估模型为随机森林模型作为示例进行说明。
具体地,随机森林模型由T个决策树组成,每个决策树之间是没有关联的,在得到随机森林模型之后,当领域候选词对应的领域候选词特征输入时,由随机森林模型中的每个决策树进行判断,即判断该领域候选词是否属于高质量词,如果领域候选词属于高质量词,则该决策树为领域候选词记为“得分”,如果领域候选词不属于高质量词,则该决策树为领域候选词记为“不得分”。为了便于理解,请参阅图4,图4为本申请实施例中基于决策树生成子分值的一个结构示意图,如图所示,假设将领域候选词“福娃”对应的领域候选词特征输入至其中一个决策树,该决策树先基于领域候选词特征所包括的词频,判断下一个分支,假设领域候选词特征所包括的词频为0.2,则继续判断领域候选词特征所包括的TFIDF值是否大于0.5。假设领域候选词特征所包括的TFIDF值为0.8,则继续判断领域候选词特征所包括的右侧信息熵是否大于0.8。假设领域候选词特征所包括的右侧信息熵为0.9,则确定领域候选词“福娃”得1分,即该决策树输出的子分值为1。
构建大量的决策树组成随机森林模型能够防止过拟合;虽然单个决策树可能存在过拟 合,但通过广度的增加就会消除过拟合现象。随机森林模型包括的T个决策树,采用投票选举的原则计算词质量分值。请参阅图5,图5为本申请实施例中基于随机森林模型生成词质量分值的一个示意图,如图所示,以T等于100为例,可以得到100个子分值,即词质量分值的满分为100。基于此,假设将领域候选词“福娃”对应的领域候选词特征输入至决策树1,由决策树1输出子分值为“1”,将领域候选词“福娃”对应的领域候选词特征输入至决策树2,由决策树2输出子分值为“0”,以此类推,如果100个子分值中有80个子分值为“1”,剩下20个子分值为“0”,那么最终得到的词质量分值即为“80”。
可以理解的是,还可以对不同的决策树赋予不同的权重值,例如,决策树1至决策树10的权重值为1,决策树11至决策树100的权重值为0.5,不同的权重值与对应的子分值相乘后进行累加,得到最终的词质量分值。
其次,本申请实施例中,提供了一种利用随机森林模型输出词质量分值的方法,通过上述方式,采用随机森林模型预测得到的词质量分值具有较高的准确率,而且通过多个决策树能够有效地评估领域候选词特征在分类问题上的重要性。此外,对于领域候选词特征不需要进行降维,也无需进行特征选择,使得获取词质量分值的效率更高。
可选地,在上述图3对应的各个实施例的基础上,在本申请实施例提供的文本挖掘方法的另一个可选实施例中,根据领域候选词所对应的词质量分值确定新词,具体包括如下步骤:
若领域候选词所对应的词质量分值大于或等于质量分阈值,则确定领域候选词属于新词;
若领域候选词所对应的词质量分值小于质量分阈值,则确定领域候选词不属于新词。
本实施例中,介绍了一种基于词质量分值判断新词的方法。为了便于说明,在本实施例中,以一个领域候选词为例进行介绍,其他领域候选词也采用类似方式得到判断是否属于新词,此处不做赘述。
具体地,以质量分阈值等于60为例。第一种情况为,假设领域候选词的词质量分值为80,即该领域候选词所对应的词质量分值80大于质量分阈值60,由此,可以将领域候选词作为新词中所包括的领域候选词。第二种情况为,假设领域候选词的词质量分值为50,即该领域候选词所对应的词质量分值50小于质量分阈值60,于是可以确定该领域候选词不属于新词。
其次,本申请实施例中,提供了一种基于词质量分值判断新词的方法,通过上述方式,将词质量分值较高的领域候选词作为新词,这样能够在一定程度上保证新词具有较高的质量,可作为领域新词的待选项,从而提升新所选新词的可靠性和准确性。
可选地,在上述图3对应的各个实施例的基础上,在本申请实施例提供的文本挖掘方法的另一个可选实施例中,根据领域候选词所对应的词质量分值确定新词,具体包括如下步骤:
获取领域候选词所对应的词频;
若领域候选词所对应的词质量分值大于或等于质量分阈值,且,领域候选词所对应的词频大于或等于第一词频阈值,则确定领域候选词属于新词;
若领域候选词所对应的词质量分值小于质量分阈值,或,领域候选词所对应的词频小于第一词频阈值,则确定领域候选词不属于新词。
本实施例中,介绍了一种基于词质量分值和词频共同判断新词的方法。为了便于说明,在本实施例中,以一个领域候选词为例进行介绍,其他领域候选词也采用类似方式得到判断是否属于新词,此处不做赘述。
具体地,以质量分阈值等于60,第一词频阈值等于0.2为例。第一种情况为,假设领域 候选词的词质量分值为80,领域候选词所对应的词频为0.5,即该领域候选词所对应的词质量分值80大于质量分阈值60,且领域候选词所对应的词频0.5大于或等于第一词频阈值0.2,由此,可以将领域候选词作为新词中所包括的领域候选词。第二种情况为,假设领域候选词的词质量分值为50,领域候选词所对应的词频为0.5,即该领域候选词所对应的词质量分值50小于质量分阈值60,且领域候选词所对应的词频0.5大于第一词频阈值0.2,于是可以确定该领域候选词不属于新词。第三种情况为,假设领域候选词的词质量分值为80,领域候选词所对应的词频为0.1,即该领域候选词所对应的词质量分值80大于质量分阈值60,且领域候选词所对应的词频0.1小于第一词频阈值0.2,于是可以确定该领域候选词不属于新词。第四种情况为,假设领域候选词的词质量分值为50,领域候选词所对应的词频为0.1,即该领域候选词所对应的词质量分值50小于质量分阈值60,且领域候选词所对应的词频0.1小于第一词频阈值0.2,于是可以确定该领域候选词不属于新词。
其次,本申请实施例中,提供了一种基于词质量分值和词频共同判断新词的方法,通过上述方式,将词质量分值较高的领域候选词作为新词,这样能够在一定程度上保证新词具有较高的质量,可作为领域新词的待选项,从而提升新所选新词的可靠性和准确性。与此同时,还考虑到该领域候选词的词频,选择词频较高的词语作为新词,能够一定程度上保证新词具有较高的传播率,更符合新词的定义。
可选地,在上述图3对应的各个实施例的基础上,在本申请实施例提供的文本挖掘方法的另一个可选实施例中,根据新词获取关联文本,具体包括如下步骤:
通过搜索引擎获取新词所对应的搜索反馈结果,其中,搜索反馈结果包括至少一条搜索结果;
根据新词对应的搜索反馈结果,从至少一条搜索结果中将相关度最高的前R条搜索结果确定为新词所对应的关联文本,其中,R为大于或等于1的整数。
本实施例中,介绍了一种获取关联文本的方法。在获取到新词后,需要对新词进行搜索,为了便于说明,在本实施例中,以一个新词为例进行介绍,其他新词也采用类似方式得到关联文本,此处不做赘述。
具体地,以新词“打王者”为例进行介绍,将该领域候选词输入至搜索引擎之后可得到搜索反馈结果,搜索反馈结果包括至少一条搜索结果。为了便于理解,请参阅图6,图6为本申请实施例中通过搜索引擎展示搜索反馈结果的一个界面示意图,如图所示,输入领域候选词“打王者”之后得到搜索反馈结果,搜索反馈结果包括10条搜索结果,将10条搜索结果按照相关度从高到低排序后得到如表1所示的结果。
表1
相关度 搜索反馈结果
第一 打王者用什么手机好
第二 打王者卡顿怎么办
第三 打王者费流量吗
第四 打王者听什么歌有节奏
第五 打王者能赚钱的软件
第六 打王者手出汗影响操作该怎么办
第七 打王者荣耀
第八 打王者为什么网络特别卡
第九 打王者表情包
第十 打王者时最佳的游戏配置模式
由表1可知,可基于搜索反馈结果,将相关度最高的前R条搜索结果作为领域候选词“上王者”的关联文本,假设R等于5,那么关联文本包括5条搜索结果,分别为“打王者用什么手机好”,“打王者卡顿怎么办”,“打王者费流量吗”,“打王者听什么歌有节奏”和“打王者能赚钱的软件”。
其次,本申请实施例中,提供了一种获取关联文本的方法,通过上述方式,以搜索引擎的搜索反馈结果作为评价新词使用频率的一个标准,能够更贴近新词使用的实际情况,有利于找出与新词相关领域的关联文本。
可选地,在上述图3对应的各个实施例的基础上,在本申请实施例提供的文本挖掘方法的另一个可选实施例中,根据新词获取关联文本之后,还包括如下步骤:
获取领域种子词;
根据关联文本,确定领域种子词的平均词频;
若平均词频大于或等于第二词频阈值,则确定领域种子词满足领域新词挖掘条件。
本实施例中,介绍了一种基于平均词频判定领域种子词是否满足领域新词挖掘条件的方法。首先需要获取领域种子词,再基于关联文本判断该领域种子词是否属于领域新词,其中,领域种子词通常为人工录入的词语,为了便于理解,请参阅图7,图7为本申请实施例中人工录入领域种子词的一个界面示意图,如图所示,用户可以通过人工录入领域种子词的界面输入新的领域种子词或者删去已有的领域种子词,每个领域种子词对应于一个词语标识,且每个领域种子词需要标注其对应的领域,例如,在“手机游戏”领域中可以包括领域种子词“打王者”、“吃鸡”和“上分”,如果还需要添加新的领域种子词,即点击“+”,并输入相关信息即可。
具体地,以领域种子词为“吃鸡”为例,基于关联文本计算领域种子词“吃鸡”的待处理词频,为了便于理解,请参阅表2,表2为领域种子词在关联文本中待处理词频的一个示意。这里的关联文本是指一个或多个关联文本,例如,包括Q个关联文本(Q为大于或等于1的整数),即关联文本与新词具有一一对应的关系,每个关联文本标识用于指示一个新词所对应的关联文本。
表2
Figure PCTCN2021102745-appb-000005
由表2可知,领域种子词“吃鸡”的平均词频为(0.1+0.5+0.2+0+0.3)/5=0.2。假设第二词频阈值为0.1,则该领域种子词“吃鸡”的平均词频大于第二词频阈值0.1,由此,可以将该领域种子词“吃鸡”确定为满足领域新词挖掘条件的领域新词。
再次,本申请实施例中,提供了一种基于平均词频判定领域种子词是否满足领域新词挖掘条件的方法,通过上述方式,如果平均词频达到词频阈值,即认为该领域种子词的使用频率较高,由此可以将领域种子词确定为领域新词,从而提升方案的可行性。
可选地,在上述图3对应的各个实施例的基础上,在本申请实施例提供的文本挖掘方法的另一个可选实施例中,根据新词获取关联文本之后,还包括如下步骤:
获取领域种子词;
根据关联文本,确定领域种子词的最大词频;
若最大词频大于或等于第二词频阈值,则确定领域种子词满足领域新词挖掘条件。
本实施例中,提供了一种基于最大词频判定领域种子词是否满足领域新词挖掘条件的方法。首先需要获取领域种子词,再基于关联文本判断该领域种子词是否属于领域新词,其中,领域种子词通常为人工录入的词语,具体的录入方式可参阅前述实施例,此处不做赘述。
具体地,以领域种子词为“吃鸡”为例,基于关联文本计算领域种子词“吃鸡”的待处理词频,为了便于理解,请参阅表3,表3为领域种子词在关联文本中的待处理词频的另一个示意。这里的关联文本是指一个或多个关联文本,例如,包括Q个关联文本(Q为大于或等于1的整数),即关联文本与新词具有一一对应的关系,每个关联文本标识用于指示一个新词所对应的关联文本。
表3
Figure PCTCN2021102745-appb-000006
由表3可知,领域种子词“吃鸡”的最大词频为0.8,假设第二词频阈值为0.7,则该领域种子词“吃鸡”的平均词频大于第二词频阈值0.7,由此,可以将该领域种子词“吃鸡” 确定为满足领域新词挖掘条件的领域新词。
再次,本申请实施例中,提供了一种基于最大词频判定领域种子词是否满足领域新词挖掘条件的方法,通过上述方式,如果最大词频达到词频阈值,即认为该领域种子词的使用频率较高,由此可以将领域种子词确定为领域新词,从而提升方案的可行性。
基于上述介绍,请参阅图8,图8为本申请实施例中挖掘领域新词的一个流程示意图,如图所示,具体地:
在步骤A1中,从领域语料库中获取语句,其中,该语句可以包括M条语句;
在步骤A2中,对获取到的语句进行分词处理,得到对应的文本序列,其中,该文本序列可以包括M个文本序列;
在步骤A3中,采用N-Gram从文本序列中提取领域候选词;
在步骤A4中,计算领域候选词的领域候选词特征;
在步骤A5中,将该领域候选词特征输入至训练好的随机森林模型进行预测,由随机森林模型输出词质量分值;
在步骤A6中,判断领域候选词的词质量分值是否大于或等于质量分阈值,若词质量分值大于或等于质量分阈值,则执行步骤A7,若词质量分值小于质量分阈值,则执行步骤A8;
在步骤A7中,判断该领域候选词的词频是否大于或等于第一词频阈值,若领域候选词的词频大于或等于第一词频阈值,则执行步骤A9,若领域候选词的词频小于第一词频阈值,则执行步骤A8;
在步骤A8中,确定该领域候选词为无意义的词语;
在步骤A9中,确定该领域候选词为新词;
在步骤A10中,从领域种子词库中获取领域种子词;
在步骤A11中,采用新词搜索关联文本;
在步骤A12中,基于搜索到的关联文本,可以计算得到该领域种子词的平均词频(或最大词频);
在步骤A13中,判断该领域种子词的平均词频(或最大词频)是否大于或等于第二词频阈值,若领域种子词的平均词频(或最大词频)大于或等于第二词频阈值,则执行步骤A15,若领域候选词的词频小于第二词频阈值,则执行步骤A14;
在步骤A14中,确定该新词不是领域新词;
在步骤A15中,确定该新词为领域新词。
可选地,在上述图3对应的各个实施例的基础上,在本申请实施例提供的文本挖掘方法的另一个可选实施例中,还包括如下步骤:
获取K组领域候选词样本,其中,每组领域候选词样本包括领域候选词正样本以及领域候选词负样本,领域候选词正样本来源于正样本池,领域候选词负样本来源于负样本池,K为大于或等于1的整数;
根据K组领域候选词样本获取K组领域候选词样本特征,其中,领域候选词样本特征与领域候选词样本具有一一对应的关系,每个领域候选词样本特征包括领域候选词正样本所对应的领域候选词样本特征以及领域候选词负样本所对应的领域候选词样本特征;
基于K组领域候选词样本特征,通过待训练文本分值预估模型获取K组预测结果,其中,预测结果与领域候选词样本特征具有一一对应的关系,每组预测结果中包括领域候选词正样本的预测标签以及领域候选词负样本的预测标签;
根据K组预测结果以及K组领域候选词样本,对待训练文本分值预估模型进行训练,直至满足模型训练条件,输出文本分值预估模型;
根据领域候选词特征,获取领域候选词所对应的词质量分值,具体包括如下步骤:
基于领域候选词特征,通过文本分值预估模型获取领域候选词所对应的词质量分值。本实施例中,介绍了一种训练文本分值预估模型的方法。假设待训练文本分值预估模型为决策树模型,那么K等于1,假设待训练文本分值预估模型为随机森林模型,则K等于T,且K为大于1的整数。
具体地,待训练文本分值预估模型为待训练的随机森林模型为例,以K组领域候选词样本中的每组领域候选词样本用于训练一个决策树。其中,每组领域候选词样本包括领域候选词正样本以及领域候选词负样本,领域候选词正样本的数量可以等于领域候选词负样本的数量。类似地,基于每组领域候选词样本中的每个领域候选词样本,提取其对应的领域候选词样本特征,即可得到K组领域候选词样本特征。每个领域候选词样本特征包括领域候选词正样本所对应的领域候选词样本特征以及领域候选词负样本所对应的领域候选词样本特征。
为了便于理解,请参阅图9,图9为本申请实施例中随机森林模型的一个训练框架示意图,如图所示,以待训练文本分值预估模型为随机森林模型为例,即K等于T,将T组领域候选词样本划分为领域候选词样本1至领域候选词样本T,再分别获取每组领域候选词样本所对应的领域候选词样本特征,即得到领域候选词样本特征1至领域候选词样本特征T。将每组领域候选词样本特征输入至待训练随机森林模型中的决策树,由每个决策树分别进行独立训练,每个决策树输出对应的预测结果。当满足模型训练条件,输出T个决策树,即得到随机森林模型。
需要说明的是,待训练文本分值预估模型可以是待训练的随机森林模型,或者决策树模型,又或者是其他类型的模型。
可以理解的是,当一个决策树的迭代次数达到阈值,或者,损失值收敛,又或者,损失值为0时,均可以认为满足模型训练条件,此处可输出文本分值预估模型。
为了便于理解,请参阅图10,图10为本申请实施例中训练文本分值预估模型的一个流程示意图,如图所示,具体地:
在步骤B1中,从领域语料库中获取语句,其中,该语句可以包括S条语句;
在步骤B2中,对获取到的语句进行分词处理,得到对应的文本序列,其中,该文本序列可以包括S个文本序列;
在步骤B3中,采用N-Gram从文本序列中提取用于模型训练的领域候选词(即得到待训练领域候选词样本);
在步骤B4中,计算待训练领域候选词样本所对应的领域候选词特征;
在步骤B5中,使用通用词语库对待训练领域候选词样本进行分类;
在步骤B6中,如果待训练领域候选词样本命中通用词语库,就将该待训练领域候选词样本加入至正样本池;
在步骤B7中,如果待训练领域候选词样本未命中通用词语库,就将该待训练领域候选词样本加入至负样本池;
在步骤B8中,将正样本池中存储的领域候选词作为领域候选词正样本,并将负样本池中存储的领域候选词作为领域候选词负样本,使用领域候选词正样本和领域候选词负样本共同训练得到文本分值预估模型,例如,训练得到随机森林模型。
再次,本申请实施例中,提供了一种训练文本分值预估模型的方法,通过上述方式,可以使用已积累的通用词语库和领域语料库构建正负样本,然后训练具有监督机器学习的文本分值预估模型来预测领域候选词的词质量分值,选择的文本分值预估模型能够最大化的利用领域候选词的所有特征,并可以适应并非十分准确的领域候选词正样本和领域候选词负样本,综合考量使用随机森林模型进行学习,可达到上述效果。
可选地,在上述图3对应的各个实施例的基础上,在本申请实施例提供的文本挖掘方法的另一个可选实施例中,获取K组领域候选词样本之前,还包括如下步骤:
从领域语料库中获取语句;
对语句进行分词处理,得到文本序列;
根据文本序列获取待训练领域候选词样本;
若待训练领域候选词样本命中通用词语库,则确定待训练领域候选词样本属于领域候选词正样本,并将待训练领域候选词样本添加至正样本池;
若待训练领域候选词样本未本命中通用词语库,则确定待训练领域候选词样本属于领域候选词负样本,并将待训练领域候选词样本添加至负样本池。
本实施例中,介绍了一种将领域候选词样本添加至正样本池和负样本池的方法。与前述实施例介绍的内容类似,在训练文本分值预估模型的过程中,从领域语料库中获取语句,这里的语句是指一个或多条语句,例如,包括S个条语句(S为大于或等于1的整数)。然后对语句进行分词处理,得到文本序列,再从文本序列中提取领域候选词样本。需要说明的是,训练所使用的语句与预测所使用的语句可能完全相同,也可能部分相同,也可能完全不同,此处不做限定。
为了便于说明,在本实施例中,将以一个领域候选词为例进行介绍,其他领域候选词样本也采用类似方式得到判定加入正样本池,还是加入负样本池,此处不做赘述。
具体地,提取到的领域候选词样本需要与通用词语库进行比对,如果领域候选词样本在通用词语库中出现,则认为该领域候选词样本属于高质量词语,将命中通用词语库的领域候选词样本加入至正样本池,即确定该领域候选词样本属于领域候选词正样本。将未命中通用词语库的领域候选词样本加入至负样本池,即确定该领域候选词样本属于领域候选词负样本。可以预见的是,负样本池中存储的领域候选词负样本数量远大于正样本池中存储的领域候选词正样本数量。
进一步地,本申请实施例中,提供了一种将领域候选词样本添加至正样本池和负样本池的方法,通过上述方式,利用通用词语库能够较为准确地将领域候选词样本划分至正样本池或负样本池,从而便于后续训练,并且有利于提升训练的准确性。此外,基于通用词语库进行匹配,节省了人工划分正负样本的过程,提升训练效率。
为了便于理解,请参阅图11,图11为本申请实施例中文本挖掘方法的一个整体流程示意图,如图所示,具体地:
在步骤C1中,从领域语料库中获取语句,其中,该语句可以包括S条语句;
在步骤C2中,对获取到的语句进行分词处理,得到对应的文本序列,其中,该文本序列可以包括S个文本序列;
在步骤C3中,采用N-Gram从文本序列中提取用于模型训练的领域候选词(即得到待训练领域候选词样本);
在步骤C4中,计算待训练领域候选词样本所对应的领域候选词特征;
在步骤C5中,使用通用词语库对待训练领域候选词样本进行分类;
在步骤C6中,如果用于训练的领域候选词命中通用词语库,就将该待训练领域候选词样本加入至正样本池;
在步骤C7中,如果待训练领域候选词样本未命中通用词语库,就将该待训练领域候选词样本加入至负样本池;
在步骤C8中,将正样本池中存储的领域候选词作为领域候选词正样本,并将负样本池中存储的领域候选词作为领域候选词负样本,使用领域候选词正样本和领域候选词负样本共同训练得到文本分值预估模型,例如,训练得到随机森林模型;
在步骤C9中,采用N-Gram从文本序列中提取领域候选词;
在步骤C10中,计算该领域候选词的领域候选词特征,再将该领域候选词特征输入至训练好的文本分值预估模型(如随机森林模型)进行预测,由文本分值预估模型(如随机森林模型)输出词质量分值;
在步骤C11中,判断领域候选词的词质量分值是否大于或等于质量分阈值,若词质量分值大于或等于质量分阈值,则执行步骤C12,若词质量分值小于质量分阈值,则执行步骤C14;
在步骤C12中,判断该领域候选词的词频是否大于或等于第一词频阈值,若领域候选词的词频大于或等于第一词频阈值,则执行步骤C145,若领域候选词的词频小于第一词频阈值,则执行步骤C14;
在步骤C13中,从领域种子词库中获取领域种子词;
在步骤C14中,确定该领域候选词为无意义的词语;
在步骤C15中,确定该领域候选词为新词;
在步骤C16中,采用新词搜索关联文本;
在步骤C17中,基于搜索到的关联文本,可以计算得到该领域种子词的平均词频(或最大词频);
在步骤C18中,判断该领域种子词的平均词频(或最大词频)是否大于或等于第二词频阈值,若领域种子词的平均词频(或最大词频)大于或等于第二词频阈值,则执行步骤C20,若领域候选词的词频小于第二词频阈值,则执行步骤C19;
在步骤C19中,确定该新词不是领域新词;
在步骤C20中,确定该新词为领域新词。
下面对本申请中的文本挖掘装置进行详细描述,请参阅图12,图12为本申请实施例中文本挖掘装置的一个实施例示意图,文本挖掘装置20包括:
获取模块201,用于获取领域候选词所对应的领域候选词特征;
获取模块201,还用于根据领域候选词特征,获取领域候选词所对应的词质量分值;
确定模块202,用于根据领域候选词所对应的词质量分值,从所述领域候选词中确定新词;
获取模块201,还用于根据新词获取关联文本;
确定模块202,还用于若根据关联文本确定领域种子词满足领域新词挖掘条件,则确定领域种子词为领域新词。
本申请实施例中,提供了一种基于人工智能的文本挖掘装置,首先获取领域候选词所对应的领域候选词特征,然后根据领域候选词特征,获取领域候选词所对应的词质量分值,再根据领域候选词所对应的词质量分值确定新词,根据新词获取关联文本。如果根据关联文本确定领域种子词满足领域新词挖掘条件,则确定领域种子词为领域新词。通过上述方式,可以基于机器学习算法通过领域候选词筛选出新词,避免了人工设定大量特征阈值的过程,从而降低了人工成本,由此,能够很好地适应互联网时代快速出现的特异化新词。
可选地,在上述图12所对应的实施例的基础上,本申请实施例提供的文本挖掘装置20的另一实施例中,
获取模块201,具体用于从领域语料库中获取语句;
对语句中的每个语句进行分词处理,得到文本序列;
根据文本序列获取领域候选词。
本申请实施例中,提供了一种基于人工智能的文本挖掘装置,采用上述装置,从领域语料库中获取语句,然后对语句进行分词处理,这些分词后的文本序列作为领域候选词的 来源,由此获取相关的领域候选词,再进一步提取每个领域候选词所对应的领域候选词特征,从而提升方案的可行性和可操作性。
可选地,在上述图12所对应的实施例的基础上,本申请实施例提供的文本挖掘装置20的另一实施例中,
获取模块201,具体用于根据词数采样阈值以及字符数采样阈值,获取文本序列所对应的领域候选词,其中,词数采样阈值表示所述领域候选词中词语数量的上限值,字符数采样阈值表示所述领域候选词中字符数量的上限值。
本申请实施例中,提供了一种基于人工智能的文本挖掘装置,采用上述装置,既可以利用N-Gram算法来评估一个语句是否合理,又可以用于评估两个字符串之间的差异程度,N-gram算法包含了前若干个词语所能提供的全部信息,这些词语对于当前词语的出现具有很强的约束力,有利于提取更准确且更丰富的领域候选词。
可选地,在上述图12所对应的实施例的基础上,本申请实施例提供的文本挖掘装置20的另一实施例中,
获取模块201,具体用于获取模块,具体用于根据文本序列获取领域候选词所对应的领域候选词特征,其中,领域候选词特征包括词频、词频逆文档频率TFIDF值、自由度、凝固度、左侧信息熵以及右侧信息熵中的至少一项。
本申请实施例中,提供了一种基于人工智能的文本挖掘装置,采用上述装置,可以对领域候选词进行特征量化,从词语权重,词语在文档中的位置以及词语的关联信息等维度,提取领域候选词的相关特征,由此构成领域候选词特征,该领域候选词特征能够很好地表达领域候选词的特点,有助于得到更准确的领域候选词评价结果。
可选地,在上述图12所对应的实施例的基础上,本申请实施例提供的文本挖掘装置20的另一实施例中,文本分值预估模型为随机森林模型,其中,随机森林模型包括T个决策树,T为大于1的整数;
获取模块201,具体用于基于领域候选词特征,通过随机森林模型所包括的决策树,获取领域候选词特征所对应的子分值;
根据领域候选词特征所对应的子分值,获取领域候选词所对应的词质量分值。
本申请实施例中,提供了一种基于人工智能的文本挖掘装置,采用上述装置,采用随机森林模型预测得到的词质量分值具有较高的准确率,而且通过多个决策树能够有效地评估领域候选词特征在分类问题上的重要性。此外,对于领域候选词特征不需要进行降维,也无需进行特征选择,使得获取词质量分值的效率更高。
可选地,在上述图12所对应的实施例的基础上,本申请实施例提供的文本挖掘装置20的另一实施例中,
确定模块202,具体用于若领域候选词所对应的词质量分值大于或等于质量分阈值,则确定领域候选词属于新词;
若领域候选词所对应的词质量分值小于质量分阈值,则确定领域候选词不属于新词。
本申请实施例中,提供了一种基于人工智能的文本挖掘装置,采用上述装置,将词质量分值较高的领域候选词作为新词,这样能够在一定程度上保证新词具有较高的质量,可作为领域新词的待选项,从而提升新所选新词的可靠性和准确性。
可选地,在上述图12所对应的实施例的基础上,本申请实施例提供的文本挖掘装置20的另一实施例中,
确定模块202,具体用于获取领域候选词所对应的词频;
若领域候选词所对应的词质量分值大于或等于质量分阈值,且,领域候选词所对应的词频大于或等于第一词频阈值,则确定领域候选词属于新词;
若领域候选词所对应的词质量分值小于质量分阈值,或,领域候选词所对应的词频小于第一词频阈值,则确定领域候选词不属于新词。
本申请实施例中,提供了一种基于人工智能的文本挖掘装置,采用上述装置,将词质量分值较高的领域候选词作为新词,这样能够在一定程度上保证新词具有较高的质量,可作为领域新词的待选项,从而提升新所选新词的可靠性和准确性。与此同时,还考虑到该领域候选词的词频,选择词频较高的词语作为新词,能够一定程度上保证新词具有较高的传播率,更符合新词的定义。
可选地,在上述图12所对应的实施例的基础上,本申请实施例提供的文本挖掘装置20的另一实施例中,
获取模块201,具体用于通过搜索引擎获取新词所对应的搜索反馈结果,其中,搜索反馈结果包括至少一条搜索结果;
根据新词对应的搜索反馈结果,从至少一条搜索结果中将相关度最高的前R条搜索结果确定为新词所对应的关联文本,其中,R为大于或等于1的整数。
根据每个领域候选词对应的关联文本,获取关联文本。
本申请实施例中,提供了一种基于人工智能的文本挖掘装置,采用上述装置,以搜索引擎的搜索反馈结果作为评价新词使用频率的一个标准,能够更贴近新词使用的实际情况,有利于找出与新词相关领域的关联文本。
可选地,在上述图12所对应的实施例的基础上,本申请实施例提供的文本挖掘装置20的另一实施例中,
获取模块201,还用于在根据新词获取关联文本之后,获取领域种子词;
确定模块202,还用于根据关联文本,确定领域种子词的平均词频;
确定模块202,还用于若平均词频大于或等于第二词频阈值,则确定领域种子词满足领域新词挖掘条件。
本申请实施例中,提供了一种基于人工智能的文本挖掘装置,采用上述装置,如果平均词频达到词频阈值,即认为该领域种子词的使用频率较高,由此可以将领域种子词确定为领域新词,从而提升方案的可行性。
可选地,在上述图12所对应的实施例的基础上,本申请实施例提供的文本挖掘装置20的另一实施例中,
获取模块201,还用于在根据新词获取关联文本之后,获取领域种子词;
确定模块202,还用于根据关联文本,确定领域种子词的最大词频;
确定模块202,还用于若最大词频大于或等于第二词频阈值,则确定领域种子词满足领域新词挖掘条件。
本申请实施例中,提供了一种基于人工智能的文本挖掘装置,采用上述装置,如果最大词频达到词频阈值,即认为该领域种子词的使用频率较高,由此可以将领域种子词确定为领域新词,从而提升方案的可行性。
可选地,在上述图12所对应的实施例的基础上,本申请实施例提供的文本挖掘装置20的另一实施例中,文本挖掘装置20还包括训练模块203;
获取模块201,还用于获取K组领域候选词样本,其中,每组领域候选词样本包括领域候选词正样本以及领域候选词负样本,领域候选词正样本来源于正样本池,领域候选词负样本来源于负样本池,K为大于或等于1的整数;
获取模块201,还用于根据K组领域候选词样本获取K组领域候选词样本特征,其中,领域候选词样本特征与领域候选词样本具有一一对应的关系,每个领域候选词样本特征包括领域候选词正样本所对应的领域候选词样本特征以及领域候选词负样本所对应的领域候 选词样本特征;
获取模块201,还用于基于K组领域候选词样本特征,通过待训练文本分值预估模型获取K组预测结果,其中,预测结果与领域候选词样本特征具有一一对应的关系,每组预测结果中包括领域候选词正样本的预测标签以及领域候选词负样本的预测标签;
训练模块203,用于根据K组预测结果以及K组领域候选词样本,对待训练文本分值预估模型进行训练,直至满足模型训练条件,输出文本分值预估模型;
获取模块201,具体用于基于领域候选词特征,通过文本分值预估模型获取领域候选词所对应的词质量分值。
本申请实施例中,提供了一种基于人工智能的文本挖掘装置,采用上述装置,可以使用已积累的通用词语库和领域语料库构建正负样本,然后训练具有监督机器学习的文本分值预估模型来预测领域候选词的词质量分值,选择的文本分值预估模型能够最大化的利用领域候选词的所有特征,并可以适应并非十分准确的领域候选词正样本和领域候选词负样本,综合考量使用随机森林模型进行学习,可达到上述效果。
可选地,在上述图12所对应的实施例的基础上,本申请实施例提供的文本挖掘装置20的另一实施例中,文本挖掘装置20还包括处理模块204;
获取模块201,还用于获取K组领域候选词样本之前,从领域语料库中获取语句;
处理模块204,用于对语句进行分词处理,得到文本序列;
获取模块201,还用于根据文本序列获取待训练领域候选词样本;
确定模块202,还用于若待训练领域候选词样本命中通用词语库,则确定待训练领域候选词样本属于领域候选词正样本,并将待训练领域候选词样本添加至正样本池;
确定模块202,还用于若待训练领域候选词样本未本命中通用词语库,则确定待训练领域候选词样本属于领域候选词负样本,并将待训练领域候选词样本添加至负样本池。
本申请实施例中,提供了一种基于人工智能的文本挖掘装置,采用上述装置,利用通用词语库能够较为准确地将领域候选词样本划分至正样本池或负样本池,从而便于后续训练,并且有利于提升训练的准确性。此外,基于通用词语库进行匹配,节省了人工划分正负样本的过程,提升训练效率。
本申请实施例还提供了另一种图像文本挖掘装置,该文本挖掘装置可以部署于服务器上。图13是本申请实施例提供的一种服务器结构示意图,该服务器300可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器(central processing units,CPU)322(例如,一个或一个以上处理器)和存储器332,一个或一个以上存储应用程序342或数据344的存储介质330(例如一个或一个以上海量存储设备)。其中,存储器332和存储介质330可以是短暂存储或持久存储。存储在存储介质330的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对服务器中的一系列指令操作。更进一步地,中央处理器322可以设置为与存储介质330通信,在服务器300上执行存储介质330中的一系列指令操作。
服务器300还可以包括一个或一个以上电源326,一个或一个以上有线或无线网络接口350,一个或一个以上输入输出接口358,和/或,一个或一个以上操作系统341,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等等。
上述实施例中由服务器所执行的步骤可以基于该图13所示的服务器结构。
本申请实施例还提供了另一种图像文本挖掘装置,该文本挖掘装置可以部署于终端设备上。如图14所示,为了便于说明,仅示出了与本申请实施例相关的部分,具体技术细节未揭示的,请参照本申请实施例方法部分。该终端设备可以为包括手机、平板电脑、个人数字助理(Personal Digital Assistant,PDA)、销售终端设备(Point of Sales,POS)、车载 电脑等任意终端设备,以终端设备为手机为例:
图14示出的是与本申请实施例提供的终端设备相关的手机的部分结构的框图。参考图14,手机包括:射频(Radio Frequency,RF)电路410、存储器420、输入单元430、显示单元440、传感器450、音频电路460、无线保真(wireless fidelity,WiFi)模块470、处理器480、以及电源490等部件。本领域技术人员可以理解,图14中示出的手机结构并不构成对手机的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
下面结合图14对手机的各个构成部件进行具体的介绍:
输入单元430可用于接收输入的数字或字符信息,以及产生与手机的用户设置以及功能控制有关的键信号输入。具体地,输入单元430可包括触控面板431以及其他输入设备432。显示单元440可用于显示由用户输入的信息或提供给用户的信息以及手机的各种菜单。显示单元440可包括显示面板441。
音频电路460、扬声器461,传声器462可提供用户与手机之间的音频接口。
在本申请实施例中,该终端设备所包括的处理器480还具有以下功能:
获取领域候选词所对应的领域候选词特征;
根据领域候选词特征,获取领域候选词所对应的词质量分值;
根据领域候选词所对应的词质量分值确定新词;
根据新词获取关联文本;
若根据关联文本确定领域种子词满足领域新词挖掘条件,则确定领域种子词为领域新词。
可选的,处理器480还用于执行如前述各个实施例描述的方法。
本申请实施例中还提供一种计算机可读存储介质,该计算机可读存储介质中存储有计算机程序,当其在计算机上运行时,使得计算机执行如前述各个实施例描述的方法。
本申请实施例中还提供一种包括程序的计算机程序产品,当其在计算机上运行时,使得计算机执行前述各个实施例描述的方法。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对相关技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部 分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (16)

  1. 一种基于人工智能的文本挖掘方法,所述方法由文本挖掘设备执行,所述方法包括:
    获取领域候选词所对应的领域候选词特征;
    根据所述领域候选词特征,获取所述领域候选词所对应的词质量分值;
    根据所述领域候选词所对应的词质量分值,从所述领域候选词中确定新词;
    根据所述新词获取关联文本;
    若根据所述关联文本确定领域种子词满足领域新词挖掘条件,则确定所述领域种子词为领域新词。
  2. 根据权利要求1所述的文本挖掘方法,所述方法还包括:
    从领域语料库中获取语句;
    对所述语句中的每个语句进行分词处理,得到文本序列;
    根据所述文本序列获取所述领域候选词。
  3. 根据权利要求2所述的文本挖掘方法,所述根据所述文本序列获取所述领域候选词,包括:
    根据词数采样阈值以及字符数采样阈值,获取所述文本序列所对应的领域候选词,其中,所述词数采样阈值表示所述领域候选词中词语数量的上限值,所述字符数采样阈值表示所述领域候选词中字符数量的上限值。
  4. 根据权利要求2所述的文本挖掘方法,所述获取领域候选词所对应的领域候选词特征,包括:
    根据所述文本序列获取所述领域候选词所对应的领域候选词特征,其中,所述领域候选词特征包括词频、词频逆文档频率TFIDF值、自由度、凝固度、左侧信息熵以及右侧信息熵中的至少一项。
  5. 根据权利要求1所述的文本挖掘方法,所述根据所述领域候选词特征,获取所述领域候选词所对应的词质量分值,包括:
    基于所述领域候选词特征,通过随机森林模型所包括的决策树,获取所述领域候选词特征所对应的子分值;
    根据所述领域候选词特征所对应的子分值,获取所述领域候选词所对应的词质量分值。
  6. 根据权利要求1所述的文本挖掘方法,所述根据所述领域候选词所对应的词质量分值确定新词,包括:
    若所述领域候选词所对应的词质量分值大于或等于质量分阈值,则确定所述领域候选词属于新词;
    若所述领域候选词所对应的词质量分值小于所述质量分阈值,则确定所述领域候选词不属于新词。
  7. 根据权利要求1所述的文本挖掘方法,所述根据所述领域候选词所对应的词质量分值确定新词,包括:
    获取所述领域候选词所对应的词频;
    若所述领域候选词所对应的词质量分值大于或等于质量分阈值,且,所述领域候选词所对应的词频大于或等于第一词频阈值,则确定所述领域候选词属于新词;
    若所述领域候选词所对应的词质量分值小于所述质量分阈值,或,所述领域候选词所对应的词频小于所述第一词频阈值,则确定所述领域候选词不属于新词。
  8. 根据权利要求1所述的文本挖掘方法,所述根据所述新词获取关联文本,包括:
    通过搜索引擎获取所述新词所对应的搜索反馈结果,其中,所述搜索反馈结果包括至少一条搜索结果;
    根据所述新词对应的搜索反馈结果,从所述至少一条搜索结果中将相关度最高的前R条搜索结果确定为所述新词所对应的关联文本,其中,所述R为大于或等于1的整数。
  9. 根据权利要求1至8中任一项所述的文本挖掘方法,所述方法还包括:
    获取所述领域种子词;
    根据所述关联文本,确定所述领域种子词的平均词频;
    若所述平均词频大于或等于第二词频阈值,则确定所述领域种子词满足所述领域新词挖掘条件。
  10. 根据权利要求1至8中任一项所述的文本挖掘方法,所述方法还包括:
    获取所述领域种子词;
    根据所述关联文本,确定所述领域种子词的最大词频;
    若所述最大词频大于或等于第二词频阈值,则确定所述领域种子词满足所述领域新词挖掘条件。
  11. 根据权利要求1所述的文本挖掘方法,所述方法还包括:
    获取K组领域候选词样本,其中,每组领域候选词样本包括领域候选词正样本以及领域候选词负样本,所述领域候选词正样本来源于正样本池,所述领域候选词负样本来源于负样本池,所述K为大于或等于1的整数;
    根据所述K组领域候选词样本获取K组领域候选词样本特征,其中,所述领域候选词样本特征与所述领域候选词样本具有一一对应的关系,每个领域候选词样本特征包括所述领域候选词正样本所对应的领域候选词样本特征以及所述领域候选词负样本所对应的领域候选词样本特征;
    基于所述K组领域候选词样本特征,通过待训练文本分值预估模型获取K组预测结果,其中,所述预测结果与所述领域候选词样本特征具有一一对应的关系,每组预测结果中包括所述领域候选词正样本的预测标签以及所述领域候选词负样本的预测标签;
    根据所述K组预测结果以及所述K组领域候选词样本,对所述待训练文本分值预估模型进行训练,直至满足模型训练条件,输出文本分值预估模型;
    所述根据所述领域候选词特征,获取所述领域候选词所对应的词质量分值,包括:
    基于所述领域候选词特征,通过所述文本分值预估模型获取所述领域候选词所对应的词质量分值。
  12. 根据权利要求11所述的文本挖掘方法,所述获取K组领域候选词样本之前,所述方法还包括:
    从领域语料库中获取语句;
    对所述语句进行分词处理,得到文本序列;
    根据所述文本序列获取待训练领域候选词样本;
    若所述待训练领域候选词样本命中通用词语库,则确定所述待训练领域候选词样本属于领域候选词正样本,并将所述待训练领域候选词样本添加至所述正样本池;
    若所述待训练领域候选词样本未本命中所述通用词语库,则确定所述待训练领域候选词样本属于领域候选词负样本,并将所述待训练领域候选词样本添加至所述负样本池。
  13. 一种文本挖掘装置,包括:
    获取模块,用于获取领域候选词所对应的领域候选词特征;
    所述获取模块,还用于根据所述领域候选词特征,获取所述领域候选词所对应的词质量分值;
    确定模块,用于根据所述领域候选词所对应的词质量分值确定新词;
    所述获取模块,还用于根据所述新词获取关联文本;
    所述确定模块,还用于若根据所述关联文本确定领域种子词满足领域新词挖掘条件,则确定所述领域种子词为领域新词。
  14. 一种计算机设备,包括:存储器、收发器、处理器以及总线系统;
    其中,所述存储器用于存储程序;
    所述处理器用于执行所述存储器中的程序,包括执行如权利要求1至12中任一项所述的文本挖掘方法;
    所述总线系统用于连接所述存储器以及所述处理器,以使所述存储器以及所述处理器进行通信。
  15. 一种计算机可读存储介质,所述存储介质用于存储计算机程序,所述计算机程序用于执行如权利要求1至12中任一项所述的文本挖掘方法。
  16. 一种包括指令的计算机程序产品,当其在计算机上运行时,使得所述计算机执行权利要求1至12中任一项所述的文本挖掘方法。
PCT/CN2021/102745 2020-09-22 2021-06-28 一种基于人工智能的文本挖掘方法、相关装置及设备 WO2022062523A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/073,519 US20230111582A1 (en) 2020-09-22 2022-12-01 Text mining method based on artificial intelligence, related apparatus and device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011001027.4A CN111931501B (zh) 2020-09-22 2020-09-22 一种基于人工智能的文本挖掘方法、相关装置及设备
CN202011001027.4 2020-09-22

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/073,519 Continuation US20230111582A1 (en) 2020-09-22 2022-12-01 Text mining method based on artificial intelligence, related apparatus and device

Publications (1)

Publication Number Publication Date
WO2022062523A1 true WO2022062523A1 (zh) 2022-03-31

Family

ID=73333906

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/102745 WO2022062523A1 (zh) 2020-09-22 2021-06-28 一种基于人工智能的文本挖掘方法、相关装置及设备

Country Status (3)

Country Link
US (1) US20230111582A1 (zh)
CN (1) CN111931501B (zh)
WO (1) WO2022062523A1 (zh)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111931501B (zh) * 2020-09-22 2021-01-08 腾讯科技(深圳)有限公司 一种基于人工智能的文本挖掘方法、相关装置及设备
CN112784009B (zh) * 2020-12-28 2023-08-18 北京邮电大学 一种主题词挖掘方法、装置、电子设备及存储介质
CN112800243A (zh) * 2021-02-04 2021-05-14 天津德尔塔科技有限公司 一种基于知识图谱的项目预算分析方法及系统
CN112668331A (zh) * 2021-03-18 2021-04-16 北京沃丰时代数据科技有限公司 一种专有词挖掘方法、装置、电子设备及存储介质
CN114492402A (zh) * 2021-12-28 2022-05-13 北京航天智造科技发展有限公司 一种科技新词识别方法及装置
CN114548100A (zh) * 2022-03-01 2022-05-27 深圳市医未医疗科技有限公司 一种基于大数据技术的临床科研辅助方法与系统
CN115017335A (zh) * 2022-06-16 2022-09-06 特赞(上海)信息科技有限公司 知识图谱构建方法和系统
CN116702786B (zh) * 2023-08-04 2023-11-17 山东大学 融合规则和统计特征的中文专业术语抽取方法和系统

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100217591A1 (en) * 2007-01-09 2010-08-26 Avraham Shpigel Vowel recognition system and method in speech to text applictions
WO2015029241A1 (en) * 2013-08-27 2015-03-05 Nec Corporation Word translation acquisition method
CN106970919A (zh) * 2016-01-14 2017-07-21 北京国双科技有限公司 新词组发现的方法及装置
CN107391486A (zh) * 2017-07-20 2017-11-24 南京云问网络技术有限公司 一种基于统计信息和序列标注的领域新词识别方法
CN110110322A (zh) * 2019-03-29 2019-08-09 泰康保险集团股份有限公司 网络新词发现方法、装置、电子设备及存储介质
CN111931501A (zh) * 2020-09-22 2020-11-13 腾讯科技(深圳)有限公司 一种基于人工智能的文本挖掘方法、相关装置及设备

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106033462B (zh) * 2015-03-19 2019-11-15 科大讯飞股份有限公司 一种新词发现方法及系统
CN106970904B (zh) * 2016-01-14 2020-06-05 北京国双科技有限公司 新词发现的方法及装置
US10108600B2 (en) * 2016-09-16 2018-10-23 Entigenlogic Llc System and method of attribute, entity, and action organization of a data corpora
US10657332B2 (en) * 2017-12-21 2020-05-19 Facebook, Inc. Language-agnostic understanding
CN110457708B (zh) * 2019-08-16 2023-05-16 腾讯科技(深圳)有限公司 基于人工智能的词汇挖掘方法、装置、服务器及存储介质
CN111026861B (zh) * 2019-12-10 2023-07-04 腾讯科技(深圳)有限公司 文本摘要的生成方法、训练方法、装置、设备及介质
CN111325018B (zh) * 2020-01-21 2023-08-11 上海恒企教育培训有限公司 一种基于web检索和新词发现的领域词典构建方法
CN111291197B (zh) * 2020-03-02 2021-05-11 北京邮电大学 基于新词发现算法的知识库构建系统

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100217591A1 (en) * 2007-01-09 2010-08-26 Avraham Shpigel Vowel recognition system and method in speech to text applictions
WO2015029241A1 (en) * 2013-08-27 2015-03-05 Nec Corporation Word translation acquisition method
CN106970919A (zh) * 2016-01-14 2017-07-21 北京国双科技有限公司 新词组发现的方法及装置
CN107391486A (zh) * 2017-07-20 2017-11-24 南京云问网络技术有限公司 一种基于统计信息和序列标注的领域新词识别方法
CN110110322A (zh) * 2019-03-29 2019-08-09 泰康保险集团股份有限公司 网络新词发现方法、装置、电子设备及存储介质
CN111931501A (zh) * 2020-09-22 2020-11-13 腾讯科技(深圳)有限公司 一种基于人工智能的文本挖掘方法、相关装置及设备

Also Published As

Publication number Publication date
US20230111582A1 (en) 2023-04-13
CN111931501A (zh) 2020-11-13
CN111931501B (zh) 2021-01-08

Similar Documents

Publication Publication Date Title
WO2022062523A1 (zh) 一种基于人工智能的文本挖掘方法、相关装置及设备
CN106897428B (zh) 文本分类特征提取方法、文本分类方法及装置
CN107451126B (zh) 一种近义词筛选方法及系统
KR102288249B1 (ko) 정보 처리 방법, 단말기, 및 컴퓨터 저장 매체
CN104899322B (zh) 搜索引擎及其实现方法
Malik et al. Comparing mobile apps by identifying ‘Hot’features
CN105183833B (zh) 一种基于用户模型的微博文本推荐方法及其推荐装置
US20150178273A1 (en) Unsupervised Relation Detection Model Training
CN108376131A (zh) 基于seq2seq深度神经网络模型的关键词抽取方法
CN110134792B (zh) 文本识别方法、装置、电子设备以及存储介质
JP2008052732A (ja) 類似性計算方法、文脈モデル導出方法、類似性計算プログラム、文脈モデル導出プログラム
CN111274365A (zh) 基于语义理解的智能问诊方法、装置、存储介质及服务器
Probierz et al. Rapid detection of fake news based on machine learning methods
US10915756B2 (en) Method and apparatus for determining (raw) video materials for news
CN111046225B (zh) 音频资源处理方法、装置、设备及存储介质
CN107885717B (zh) 一种关键词提取方法及装置
WO2021196541A1 (zh) 用于搜索内容的方法、装置、设备和计算机可读存储介质
CN112989208B (zh) 一种信息推荐方法、装置、电子设备及存储介质
WO2022141876A1 (zh) 基于词向量的搜索方法、装置、设备及存储介质
CN112559684A (zh) 一种关键词提取及信息检索方法
CN111160007B (zh) 基于bert语言模型的搜索方法、装置、计算机设备及存储介质
CN111324771A (zh) 视频标签的确定方法、装置、电子设备及存储介质
CN112632285A (zh) 一种文本聚类方法、装置、电子设备及存储介质
CN111325018A (zh) 一种基于web检索和新词发现的领域词典构建方法
CN111309916A (zh) 摘要抽取方法和装置、存储介质和电子装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21870903

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 09.08.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21870903

Country of ref document: EP

Kind code of ref document: A1