CN110457708A

CN110457708A - Vocabulary mining method, apparatus, server and storage medium based on artificial intelligence

Info

Publication number: CN110457708A
Application number: CN201910760785.5A
Authority: CN
Inventors: 王朔遥
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-08-16
Filing date: 2019-08-16
Publication date: 2019-11-15
Anticipated expiration: 2039-08-16
Also published as: CN110457708B

Abstract

The present invention provides a kind of vocabulary mining method, apparatus, server and storage medium based on artificial intelligence, belongs to field of artificial intelligence.The described method includes: obtaining the first samples of text, the second samples of text and theme dictionary；According to first samples of text, second samples of text and the first text identification model, at least one first vocabulary is determined；First samples of text and second samples of text are inputted into the second text identification model, according to the second text identification model output as a result, determining at least one second vocabulary；At least one neologisms is determined based on first samples of text, first vocabulary, second vocabulary and the theme dictionary.Due to excavating vocabulary from different angles such as word frequency, solidification degree, freedom degree and phrase cooccurrence relations, so that it is determined that at least one neologisms accuracy it is high and stablize, and Training is carried out due to not needing a large amount of labeled data, so as to the manpower and material resources of saving.

Description

Vocabulary mining method, apparatus, server and storage medium based on artificial intelligence

Technical field

The present invention relates to field of artificial intelligence, in particular to a kind of vocabulary mining method based on artificial intelligence, dress It sets, server and storage medium.

Background technique

In artificial intelligence field, natural language processing is an important research direction, can be realized people for studying The various theory and methods of efficient communication are carried out with natural language between computer.The key points and difficulties of natural language processing exist In how realizing that the meaning of one's words understands and text analyzing that and the basis of semantic understanding and text analyzing is the construction of subject dictionary.And it builds If the process of subject dictionary that is to say the process for finding neologisms by vocabulary mining.

Currently, with the continuous development of machine learning, natural language processing field there are it is a variety of based on deep learning into The technology of row vocabulary mining acquisition neologisms.Carrying out vocabulary mining based on deep learning is usually to obtain a large amount of sample data, In Have and comprehensive analysis is carried out to sample data under the guidance of supervision algorithm, to judge that neologisms generate the rule of variation, such as detects word Combine variation into syllables, the co-occurrence word distribution indexs such as consistency and emotion tendency carry out the changing rule of internet new words.

Since the vocabulary mining technology based on deep learning needs a large amount of labeled data to carry out the training for having supervision, and count A large amount of manpower and material resources can be expended according to mark, and the accuracy of Result and the accuracy of data mark are positively correlated, thus Cause the dependence marked to data big, accuracy is unstable.

Summary of the invention

The embodiment of the invention provides a kind of, and vocabulary mining method, apparatus, server and storage based on artificial intelligence are situated between Matter, the vocabulary mining technology for solving currently based on deep learning need a large amount of labeled data to carry out the training for having supervision, And data mark can expend a large amount of manpower and material resources, and the accuracy of Result and the accuracy of data mark are positively correlated, It is big so as to cause the dependence marked to data, the unstable problem of accuracy.The technical solution is as follows:

On the one hand, a kind of vocabulary mining method based on artificial intelligence is provided, which is characterized in that the described method includes:

The first samples of text, the second samples of text and theme dictionary are obtained, first samples of text is target topic pair The samples of text including vocabulary to be excavated answered, second samples of text are that theme similar with the target topic is corresponding Samples of text, the theme dictionary include the multiple vocabulary for belonging to the target topic；

According to first samples of text, second samples of text and the first text identification model, at least one is determined First vocabulary, first vocabulary are that word frequency is higher than the first word frequency, in the second text sample in first samples of text Word frequency is higher than target solidification degree lower than the second word frequency, solidification degree in this and freedom degree is lower than the vocabulary of target freedom degree；

First samples of text and second samples of text are inputted into the second text identification model, according to described second Text identification model output as a result, determine at least one second vocabulary, second vocabulary be in first samples of text In be keyword and in second samples of text be non-key word vocabulary；

It is determined at least based on first samples of text, first vocabulary, second vocabulary and the theme dictionary One neologisms.

On the other hand, a kind of vocabulary mining device based on artificial intelligence is provided, which is characterized in that described device packet It includes:

Module is obtained, for obtaining the first samples of text, the second samples of text and theme dictionary, first samples of text For the corresponding samples of text including vocabulary to be excavated of target topic, second samples of text is similar to the target topic The corresponding samples of text of theme, the theme dictionary includes the multiple vocabulary for belonging to the target topic；

Determining module is used for according to first samples of text, second samples of text and the first text identification model, Determine at least one first vocabulary, first vocabulary in first samples of text word frequency be higher than the first word frequency, in institute State that word frequency in the second samples of text is higher than target solidification degree lower than the second word frequency, solidification degree and freedom degree is lower than target freedom degree Vocabulary；

The determining module is also used to first samples of text and second samples of text inputting the second text and know Other model, according to the second text identification model output as a result, determining that at least one second vocabulary, second vocabulary are In first samples of text be keyword and in second samples of text be non-key word vocabulary；

The determining module is also used to based on first samples of text, first vocabulary, second vocabulary and institute It states theme dictionary and determines at least one neologisms.

In one possible implementation, the determining module is also used to segment first samples of text At least one third vocabulary is obtained, second samples of text is segmented to obtain at least one the 4th vocabulary；By described in extremely Few third vocabulary and at least one the described input data of the 4th vocabulary as the first text identification model；According to institute The output of the first text identification model is stated as a result, determining at least one first vocabulary.

In alternatively possible implementation, the determining module is also used to first samples of text and described Second samples of text inputs the second text identification model, and the algorithm based on the second text identification model realization constructs text This figure network structure；According to the figure network structure, at least one first keyword is obtained from first samples of text, At least one second keyword is obtained from second samples of text；Deletion and institute from least one described first keyword State at least one duplicate vocabulary of the second keyword；Using at least one remaining first keyword as it is described at least one second Vocabulary.

In alternatively possible implementation, the determining module, be also used to in the theme dictionary seed words, First vocabulary and second vocabulary are dictionary, segment to first samples of text, obtain multiple 5th vocabulary； It is clustered according to the term vector of the term vector of the multiple 5th vocabulary and the seed words；It is determined at least according to cluster result One neologisms.

In alternatively possible implementation, the determining module is also used to the term vector pair according to the seed words Connectivity Clustering Model is initialized；Mode based on similarity transmitting connects the term vector of the multiple 5th vocabulary General character cluster.

In alternatively possible implementation, the determining module is also used to the word for the multiple 5th vocabulary Vector connects two term vectors that distance is less than target range；When any 5th vocabulary term vector and seed words word to When amount is directly connected to, using the 5th vocabulary and the seed words as same class；When the term vector and kind of any 5th vocabulary When the term vector of sub- word is by the term vector indirect communications of other the 5th vocabulary, according to shortest path determine the 5th vocabulary with The indirect similarity of the seed words, if the indirect similarity is not less than target similarity, by the 5th vocabulary and institute Seed words are stated as same class.

In alternatively possible implementation, the determining module is also used to select to meet from the cluster result At least one noun and verb of target topic, as the neologisms of the target topic, the target topic is the descriptor Theme belonging to library.

On the other hand, a kind of server is provided, the server includes processor and memory, and the memory is used for Store program code, said program code loaded by the processor and executed with realize in the embodiment of the present invention based on artificial Performed operation in the vocabulary mining method of intelligence.

On the other hand, a kind of storage medium is provided, program code, said program code are stored in the storage medium For executing the vocabulary mining method based on artificial intelligence in the embodiment of the present invention.

Technical solution provided in an embodiment of the present invention has the benefit that

In embodiments of the present invention, by being determined according to the first text identification model in the first samples of text medium-high frequency and tool There is the first vocabulary of high solidification degree and low degree-of-freedom, and is determined in the first samples of text according to the second text identification model as pass It is new to obtain at least one based on first samples of text, the first vocabulary, the second vocabulary and theme dictionary for second vocabulary of keyword Word.Due to excavating vocabulary from different angles such as word frequency, solidification degree, freedom degree and phrase cooccurrence relations, so that it is determined that At least one neologisms accuracy is high and stablizes, and carries out Training due to not needing a large amount of labeled data, so as to The manpower and material resources of saving.

Detailed description of the invention

To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing.

Fig. 1 is a kind of structural block diagram of the vocabulary mining system 100 based on artificial intelligence provided in an embodiment of the present invention；

Fig. 2 is a kind of vocabulary mining method flow diagram based on artificial intelligence provided in an embodiment of the present invention；

Fig. 3 is a kind of flow chart of vocabulary mining method based on artificial intelligence provided in an embodiment of the present invention；

Fig. 4 is a kind of block diagram of vocabulary mining square law device based on artificial intelligence provided in an embodiment of the present invention；

Fig. 5 is a kind of structural schematic diagram of the server provided according to embodiments of the present invention.

Specific embodiment

To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached drawing to embodiment party of the present invention Formula is described in further detail.

Example embodiments are described in detail here, and the example is illustrated in the accompanying drawings.Following description is related to When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment Described in embodiment do not represent all embodiments consistented with the present invention.On the contrary, they be only with it is such as appended The example of device and method being described in detail in claims, some aspects of the invention are consistent.

The embodiment of the present invention relates generally to the scene of vocabulary mining, such as in natural language processing task, for what is given Theme carries out vocabulary mining based on topic corpus and control corpus to obtain neologisms, to carry out to existing theme dictionary Expand.Wherein, topic corpus is text library relevant to research theme, such as the source of topic corpus can be text of transferring accounts Originally, text, retrieval text and dialog text etc. are putd question to.Control corpus is the text of other themes similar with research theme This library, for example, topic corpus be trade company transfer accounts text when, control corpus can be other texts of transferring accounts；Topic corpus When being the retrieval text of certain professional domain, control corpus can be the retrieval text of relevant art.

The main flow of vocabulary mining in the embodiment of the present invention is described below:

Firstly, the first samples of text relevant to given theme is obtained, and as the second samples of text of control, and Obtain existing theme dictionary.The vocabulary for including in the existing theme dictionary is excavated relevant to theme before being all Vocabulary.The neologisms that the embodiment of the present invention is excavated can be used for expanding the theme dictionary.Secondly, according to the first samples of text, Two samples of text and the theme dictionary obtain the first vocabulary and the second vocabulary respectively.Again, first vocabulary, the second word are based on Seed words in remittance and theme dictionary are clustered.Finally, the result based on cluster determines at least one neologisms.

The technology that the embodiment of the present invention may be used simply is introduced again:

Natural language processing technique.In artificial intelligence field, natural language processing (Nature Language Processing, NLP) it is an important research direction, natural language is used between people and computer for studying can be realized Carry out the various theory and methods of efficient communication.Natural language processing is one and has merged linguistics, computer science and number Science.Natural language processing technique generally includes text-processing, meaning of one's words understanding, machine translation, robot question and answer, knowledge graph The technologies such as spectrum.

PMI (Pointwise Mutual Information) puts mutual information.PMI algorithm is that best measurement vocabulary is related One of algorithm of degree, main focus are the bond strength (i.e. freedom degree and solidification degree) between word string.

TF-IDF (term frequency-inverse document frequency) word frequency -- inverted file frequency. TF-IDF is a kind of common weighting technique for information retrieval and text mining, to assess a word for a file or The repetition degree of a field file set in one corpus of person.The number that the importance of words occurs hereof with it Directly proportional increase, but the frequency that can occur in corpus with it simultaneously is inversely proportional decline.

Word2vec(word to vector).Word2vec is the correlation model that a group is used to generate term vector, by holding in the palm Maas rice section love (Tomas Mikolov) is created in the research team that Google is led, for generating the phase of term vector Close model.The present invention carries out semanteme and the similarity calculation of structure between word string using the algorithm.

Connectivity cluster.Typical Representative is hierarchical clustering algorithm, it constructs cluster according to the connectivity between sample, is owned The sample of connection belongs to the same cluster.In cluster, from seed node, cluster (class) is constructed according to similarity, that is, is passed through The outer word of the highest cluster of similarity is selected to expand current cluster, until the word outside cluster is lower than some with the similarity of current cluster Threshold value.

Chinese word segmentation.Participle is exactly the process that continuous word sequence is reassembled into word sequence according to certain specification. Existing segmentation methods can be divided into three categories: the segmenting method based on string matching, the segmenting method based on understanding and be based on The segmenting method of statistics.It is combined according to whether with part-of-speech tagging process, and simple segmenting method and participle and mark can be divided into Infuse the integral method combined.

TextRank algorithm.TextRank algorithm is a kind of sort algorithm based on figure for text.Its basic thought From the PageRank algorithm of Google, by the way that text segmentation at several component units (word, sentence) and is established graph model, The important component in text is ranked up using voting mechanism, keyword can be realized merely with the information of single document itself It extracts, digest.The models such as TextRank and LDA, HMM are different, and TextRank does not need to carry out study instruction to multiple documents in advance Practice, is used widely because it is succinct effective.The present invention realizes the neologisms hair of word string cooccurrence relation angle using the algorithm It is existing.

Fig. 1 is a kind of structural block diagram of the vocabulary mining system 100 based on artificial intelligence provided in an embodiment of the present invention, ginseng See Fig. 1, the vocabulary mining system based on artificial intelligence of being somebody's turn to do includes multiple terminals 101 and vocabulary mining platform 102.

Terminal 101 is connected by wireless network or cable network with vocabulary mining platform 102.Terminal 101 can be intelligence At least one of mobile phone, desktop computer, tablet computer and pocket computer on knee.It is applied with the system 100 in trade company Transfer accounts theme scene for be illustrated, 101 installation and operation of terminal has the application program that support is transferred accounts.The application program can To be financial class application program, instant messaging class application program, social category application program etc..Terminal 101 can be trade company user The terminal used is logged in the account of trade company user in the application program run in terminal 101.

Vocabulary mining platform 102 includes at least one of a server, multiple servers and cloud computing platform.Vocabulary Mining Platform can collect transfer data when user is transferred accounts by terminal 101, wherein the transfer data got is to use The data that family has authorized.

Optionally, vocabulary mining platform 102 includes: access server, vocabulary mining server and database.Access service Device is for providing the access service of terminal 101.Vocabulary mining server is for providing vocabulary mining service.Database is for storing Samples of text and theme dictionary etc..Vocabulary mining server can be one or more.When vocabulary mining server is more, It is used to provide different services in the presence of at least two vocabulary mining servers, and/or, there are at least two vocabulary mining servers Same service is provided for providing identical service, such as with load balancing mode, the embodiment of the present application is to this without limit It is fixed.The first text identification model and the second text identification model can be set in vocabulary mining server.Implement in the application In example, the first text identification model is for determining that high frequency solidifies word vocabulary, and the second text identification model is for determining keyword word It converges.

Fig. 2 is a kind of vocabulary mining method flow diagram based on artificial intelligence provided in an embodiment of the present invention, such as Fig. 2 institute Show, is illustrated for being applied in server in embodiments of the present invention.The vocabulary mining method based on artificial intelligence The following steps are included:

201, server obtains the first samples of text, the second samples of text and theme dictionary, which is mesh The corresponding samples of text including vocabulary to be excavated of theme is marked, which is that theme similar with target topic is corresponding Samples of text, which includes the multiple vocabulary for belonging to the target topic.

In this step, server can obtain the samples of text including vocabulary to be excavated according to given target topic and make For first samples of text, such as given target topic is trade company, then server can will be marked as turning for the user of trade company Account text is as first samples of text.Further, which can be the target time section of server acquisition Samples of text, such as nearest one month text of transferring accounts, recently trimestral text of transferring accounts.The text of transferring accounts is trade company user The text inputted when transferring accounts can be used to indicate that the purpose transferred accounts or want to beneficiary word, such as " XX payment for goods ", " XX Commodity " etc..

In this step, server can determine other themes similar with the theme, root according to given target topic The second samples of text is obtained according to other themes.As the first samples of text be trade company user transfer accounts text when, the second text sample It originally can be with the text of transferring accounts of right and wrong trade company user.By the way that pair of the second samples of text of similar topic as first sample is arranged According to enabling the neologisms excavated more effectively to distinguish with similar theme, thus the degree of correlation of neologisms and target topic It is high.

In this step, server can obtain and target master according to the corresponding relationship of theme and theme library from database Inscribe corresponding theme dictionary.It include multiple trade companies in trade company's dictionary if the corresponding theme dictionary of trade company's theme is trade company's dictionary Vocabulary.Wherein, the vocabulary in the theme dictionary can be the vocabulary obtained by other means, be also possible to real through the invention The vocabulary that the method for applying example offer obtains.

202a, server determine at least one according to the first samples of text, the second samples of text and the first text identification model A first vocabulary, first vocabulary are that word frequency is higher than the first word frequency, the word frequency in the second samples of text in the first samples of text It is higher than target solidification degree lower than the second word frequency, solidification degree and freedom degree is lower than the vocabulary of target freedom degree.

In this step, server can pre-process the first samples of text and the second samples of text, the pretreatment Process can be the process segmented to first samples of text and the second samples of text.The first text identification model can be with It include realizing the first algoritic module of PMI algorithm and realizing the second of TF-IDF algorithm for PMI-TF-IDF model, in the model Algoritic module.

In an optional implementation manner, server pre-processes the first samples of text and the second samples of text, The step of determining at least one first vocabulary according to the first text identification model can be with are as follows: server can be to the first samples of text It is segmented to obtain at least one third vocabulary, the second samples of text is segmented to obtain at least one the 4th vocabulary.Service Device can be using at least one obtained third vocabulary and at least one obtained the 4th vocabulary as the first text identification model Input data, according to the output of the first text identification model as a result, determining at least one first vocabulary.

In this step, server can be according to realizing the first algoritic module of PMI algorithm from least one obtained the At least one vocabulary that solidification degree is higher than target solidification degree and freedom degree is lower than target freedom degree is extracted in three vocabulary, from what is obtained At least one word that solidification degree is higher than target solidification degree and freedom degree is lower than target freedom degree is extracted at least one the 4th vocabulary It converges.

Wherein, solidification degree refers to the tightness degree in vocabulary between word and word, the vocabulary as " coloured glaze ", " durian " Solidification degree it is just very high, and the solidification degree of word as " child ", " combination " is with regard to relatively low.Freedom degree refers to word in vocabulary The degree that can freely use, such as the solidification degree degree that is higher, and freely using of " chalk " and " gram force " in " chocolate " It is almost nil, namely " chalk " and " gram force " cannot be individually at word.Target solidification degree and target freedom degree can be according to practical need It asks and is configured, the embodiment of the present invention is without concrete restriction.

In this step, server can be according to realizing the second algoritic module of TD-IDF algorithm from least one obtained The high frequency vocabulary that the word frequency in the first samples of text is higher than the first word frequency is extracted in third vocabulary, from least one obtained 4th The low frequency words that the word frequency in the second samples of text is lower than the second word frequency are extracted in vocabulary.Server can be by high frequency vocabulary and low At least one identical vocabulary is selected in frequency vocabulary.Wherein, the first word frequency and the second word frequency can be set according to actual needs It sets, the embodiment of the present invention is without concrete restriction.

It should be noted that server can first pass through the word that the first algoritic module extracts high solidification degree and low degree-of-freedom Converge, then high frequency vocabulary and low frequency words are extracted from the vocabulary of extraction by the second algoritic module, thus obtain at least one the One vocabulary；Server can also first pass through the second algoritic module and extract high frequency vocabulary and low frequency words, then pass through the first algorithm mould Block extracts the vocabulary of high solidification degree and low degree-of-freedom, to obtain at least one first vocabulary；Server can also pass through respectively First algoritic module and the second algoritic module extract vocabulary, the intersection of the vocabulary extracted are then sought, to obtain at least one A first vocabulary；Server can also extract vocabulary by the first algoritic module and the second algoritic module simultaneously, directly obtain to Few first vocabulary.

202b, server are using the first samples of text and the second samples of text as the input number of the second text identification model According to according to the second text identification model output as a result, determining at least one second vocabulary, the second vocabulary is in the first text Be in this sample keyword and in the second samples of text be non-key word vocabulary.

In this step, which can be TextRank model, include in the model for realizing The third algorithm module of TextRank algorithm.Server can determine at least one second vocabulary according to the third algorithm module.

In an optional implementation manner, server determines the step of at least one the second vocabulary according to third algorithm module It suddenly can be with are as follows: the first samples of text and the second samples of text can be inputted the second text identification model by server, and server can With the third algorithm module for including based on the second text identification model, namely realize that the third algorithm module of TextRank algorithm is come Construct the figure network structure of text.Server can obtain at least one from the first samples of text according to the figure network structure First key vocabularies obtain at least one second key vocabularies from the second samples of text.Server can from least one In one keyword delete at least one duplicate vocabulary of the second key vocabularies, using at least one remaining first keyword as At least one second vocabulary.

In this step, server can be stored figure network structure with Node2Node (point-to-point) structure type. And when the content that the first samples of text and/or the second samples of text include is more, third algorithm module realizes that TextRank is calculated A large amount of memory is needed to carry out structural map network structure when method, server can optimize the storage mode of figure network structure.

In an optional implementation manner, server can tie the storage mode of figure network structure from Node2Node Structure is optimized for the structure of EdgeList (effective edge) namely above-mentioned figure network structure can be the figure network knot based on effective edge Structure.

It should be noted that step 202a and step 202b may be performed simultaneously, held after step 202a can also be first carried out Row step 202b executes step 202a after can also first carrying out step 202b, and it is not limited in the embodiment of the present invention.

203, server carries out the first samples of text according to seed words, the first vocabulary and the second vocabulary in theme dictionary Participle, obtains multiple 5th vocabulary.

In this step, server can obtain at least one seed words from theme dictionary., write inscription based on the seed words Weight is not less than the vocabulary of target weight threshold value in library.

In an optional implementation manner, server the step of at least one seed words is obtained from theme dictionary can be with Are as follows: server determines the weight of each vocabulary in the theme dictionary, when the weight of any vocabulary is not less than target weight threshold value, Using the vocabulary as seed words.

Certainly, in the present embodiment, server obtains a seed words.Server can be after obtaining theme dictionary At least one seed words is directly acquired, at least one seed words, the embodiment of the present invention pair can also be obtained when executing this step This is without concrete restriction.When theme Word library updating, server can reacquire seed words.

In this step, since the first vocabulary and the second vocabulary are that server passes through the first text identification model and the second text The vocabulary that this identification model obtains, the first vocabulary and the second vocabulary can be stored in candidate dictionary by server, with the candidate The seed words in vocabulary and theme dictionary in dictionary are dictionary, segment to the first samples of text, obtain multiple 5th words It converges.Using the first vocabulary, the second vocabulary and seed words as dictionary when due to being segmented to the first samples of text, obtain multiple the Five vocabulary more meet theme, and are more in line with natural language.

In this step, server can carry out vectorization to obtained multiple 5th vocabulary according to Word2vec model, To obtain the term vector of multiple 5th vocabulary.

In an optional implementation manner, the parameter of the adjustable Word2vec model of server, come so that Word2vec model is more suitable for samples of text.If the default value of parameter window-size (window size) is 10, it is suitable for place Long text is managed, wherein window indicates current vocabulary and predicts maximum distance of the vocabulary in sentence is how many.Work as window- When size is tuned up, term vector can be made increasingly similar with theme；When window-size is turned down, can term vector be produced The similitude of raw more functions and syntax.For another example the default value of parameter vector-size (vector number) is 100, when When vector-size is tuned up, the accuracy of the similarity between vocabulary and vocabulary can decline, and the dimension of similarity will increase.

For example, server when handling trade company's theme, since the text of transferring accounts of trade company is mostly short and small, has the spy of fragmentation Point can be not suitable for the text of transferring accounts of trade company the default value of this parameter of window-size, therefore server can incite somebody to action Window-size tunes up the text of transferring accounts that trade company is handled for 25.

For another example, for this parameter of vector-size, for example, when vector-size is smaller, " Tai Di " is more Can and " dog food ", " traction rope " link together, and when vector-size is gradually increased, " Tai Di " starts and " Persian cat ", Vocabulary such as " pets " link together, and can reduce with the similarity of the vocabulary such as " dog food ", " traction rope ".Therefore, server is being located When managing trade company's theme, the value of vector-size, the vector- can be determined according to the total number of vocabulary in samples of text The value of size can be sqrt (| V |)/2, wherein square root is sought in sqrt () expression, | V | indicate vocabulary total number.

204, server is clustered according to the term vector of multiple 5th vocabulary and the term vector of the seed words.

In this step, server can be clustered by connectivity Clustering Model, and the term vector of seed words can be with For initializing the connectivity Clustering Model.Wherein, term vector is expression of the natural language vocabulary in word space, between vocabulary Distance represents the similarity between vocabulary.

In an optional implementation manner, server is carried out according to the term vector of multiple 5th vocabulary and the seed words Sorting procedure can be with are as follows: server can initialize connectivity Clustering Model according to seed words, be then based on similarity The mode of transmitting carries out connectivity cluster to the term vector of multiple 5th vocabulary.

Specifically, server carries out connectivity cluster to the term vector of multiple 5th vocabulary based on the mode that similarity is transmitted The step of can be with are as follows: for the term vector of multiple 5th vocabulary, server distance can be less than two words of target range to Amount connection, when the term vector of any 5th vocabulary is directly connected to the term vector of seed words, server can be by the 5th word It converges with seed words as same class, is classified as same clustering cluster；When the term vector of any 5th vocabulary and the term vector of seed words are logical When crossing the term vector indirect communication of other the 5th vocabulary, server can determine the 5th vocabulary and seed words according to shortest path Indirect similarity, if the indirect similarity is not less than target similarity, using the 5th vocabulary and the seed words as together One kind is classified as same clustering cluster.Wherein, target range and target similarity can be configured according to actual needs, the present invention Embodiment is to this without concrete restriction.The indirect similarity is true by the transmission path and attenuation coefficient of the vocabulary and seed words It is fixed.

For example, target similarity is 0.75, and target range 2, two term vectors connection of the server by distance less than 2, For seed words Z, there is the 5th vocabulary A and B directly to connect, A is connect with C, D, and B is connect with E, and D is connect with F namely C and D It can be connected to by A with Z, E can be connected to by B with Z, and F can be connected to by D and A with Z.D is 3 via the distance of A to Z, is led to Overdamping coefficient determines that the indirect similarity of D and Z is 0.8, based on same mode it can be concluded that the indirect similarity of C and Z is The indirect similarity of 0.78, E and Z is 0.83, and the shortest path of F to Z is 5, and the indirect phase of F and Z is determined by attenuation coefficient It is 0.6 like degree.To which A, B, C, D, E and Z can be classified as same clustering cluster.

205, server determines at least one neologisms according to cluster result.

In this step, server can be by cluster result directly as neologisms, so that it is determined that at least one neologisms；Service Device can also screen cluster result, determine at least one neologisms.

In an optional implementation manner, server screens cluster result, determines the step of at least one neologisms It suddenly can be with are as follows: server can select at least one noun for meeting target topic and verb from cluster result, as target The neologisms of theme, the target topic are the theme theme belonging to dictionary.

For example, server can reject the vocabulary of some versatilities from cluster result, such as name, address, blessing language, day Phase, particular meaning vocabulary etc., to retain the noun and verb for meeting target topic, such as " tealeaves ", " trendy ", " upper new ".

It should be noted that above-mentioned steps 203 to step 204 is that server is based on the first samples of text, the first vocabulary, the A kind of optional implementation that two vocabulary and theme dictionary are clustered, the method that server is also based on figure e-learning The entire neighborhood information is utilized to be clustered, the present invention is to this without concrete restriction.

It should also be noted that, may refer to Fig. 3 in order to which the process for executing above-mentioned steps 201 to step 205 is more clear Shown, Fig. 3 is a kind of flow chart of vocabulary mining method based on artificial intelligence provided in an embodiment of the present invention.It is divided into four portions Point, first part corresponds to step 201, and second part corresponds to step 202a and step 202b, and Part III corresponds to step 203 and step Rapid 204, Part IV corresponds to step 205.

Fig. 4 is a kind of block diagram of the vocabulary mining square law device based on artificial intelligence provided according to an exemplary embodiment. The device is used to execute the step of when the above-mentioned vocabulary mining method based on artificial intelligence executes, and referring to fig. 4, device includes: to obtain Modulus block 401 and determining module 402.

Module 401 is obtained, for obtaining the first samples of text, the second samples of text and theme dictionary, the first samples of text For the corresponding samples of text including vocabulary to be excavated of target topic, the second samples of text is theme pair similar with target topic The samples of text answered, theme dictionary include the multiple vocabulary for belonging to target topic；

Determining module 402, for determining according to the first samples of text, the second samples of text and the first text identification model At least one first vocabulary, the first vocabulary are that word frequency is higher than the first word frequency, in the second samples of text in the first samples of text Word frequency is higher than target solidification degree lower than the second word frequency, solidification degree and freedom degree is lower than the vocabulary of target freedom degree；

Determining module 402 is also used to the first samples of text and the second samples of text inputting the second text identification model, root According to the output of the second text identification model as a result, determining at least one second vocabulary, the second vocabulary is in the first samples of text It is the vocabulary of non-key word for keyword and in the second samples of text；

Determining module 402 is also used to determine extremely based on the first samples of text, the first vocabulary, the second vocabulary and theme dictionary Few neologisms.

In one possible implementation, determining module 402, be also used to that the first samples of text is segmented to obtain to Few third vocabulary, segments the second samples of text to obtain at least one the 4th vocabulary；By at least one third vocabulary With at least one input data of the 4th vocabulary as the first text identification model；According to the output knot of the first text identification model Fruit determines at least one first vocabulary.

In alternatively possible implementation, determining module 402 is also used to the first samples of text and the second text sample This input the second text identification model, the figure network structure of the algorithm building text based on the second text identification model realization；Root According to figure network structure, at least one first keyword is obtained from the first samples of text, is obtained at least from the second samples of text One the second keyword；It is deleted and at least one duplicate vocabulary of the second keyword from least one first keyword；It will remain First keyword of at least one remaining is as at least one the second vocabulary.

In alternatively possible implementation, determining module 402, be also used to in theme dictionary seed words, first Vocabulary and the second vocabulary are dictionary, segment to the first samples of text, obtain multiple 5th vocabulary；According to multiple 5th vocabulary Term vector and the term vectors of seed words clustered；At least one neologisms is determined according to cluster result.

In alternatively possible implementation, determining module 402 is also used to the term vector according to seed words to connectivity Clustering Model is initialized；Mode based on similarity transmitting carries out connectivity cluster to the term vector of multiple 5th vocabulary.

In alternatively possible implementation, determining module 402 is also used to the term vector for multiple 5th vocabulary, Two term vectors that distance is less than target range are connected；When the term vector of any 5th vocabulary and the term vector of seed words are direct When connection, using the 5th vocabulary and seed words as same class；When the term vector of any 5th vocabulary and the term vector of seed words are logical When crossing the term vector indirect communication of other the 5th vocabulary, determine that the 5th vocabulary is indirect similar to seed words according to shortest path Degree, if similarity is not less than target similarity indirectly, using the 5th vocabulary and seed words as same class.

In alternatively possible implementation, determining module 402 is also used to select to meet target master from cluster result At least one noun and verb of topic, as the neologisms of target topic, target topic is the theme theme belonging to dictionary.

It should be understood that the vocabulary mining device provided by the above embodiment based on artificial intelligence is in operation application program When, only the example of the division of the above functional modules, in practical application, it can according to need and divide above-mentioned function With being completed by different functional modules, i.e., the internal structure of device is divided into different functional modules, to complete above description All or part of function.In addition, the vocabulary mining device provided by the above embodiment based on artificial intelligence with based on artificial The vocabulary mining embodiment of the method for intelligence belongs to same design, and specific implementation process is detailed in embodiment of the method, no longer superfluous here It states.

Fig. 5 is a kind of structural schematic diagram of server provided in an embodiment of the present invention, which can be because of configuration or property Energy is different and generates bigger difference, may include one or more processors (central processing Units, CPU) 501 and one or more memory 502, wherein at least one finger is stored in the memory 502 It enables, at least one instruction is loaded by the processor 501 and executed the side to realize above-mentioned each embodiment of the method offer Method.Certainly, which can also have the components such as wired or wireless network interface, keyboard and input/output interface, so as to Input and output are carried out, which can also include other for realizing the component of functions of the equipments, and this will not be repeated here.

The embodiment of the invention also provides a kind of computer readable storage medium, which is applied to Server is stored with program code in the computer readable storage medium, which is loaded by processor and executed with reality Operation performed by server in the vocabulary mining method based on artificial intelligence of existing above-described embodiment.

Those of ordinary skill in the art will appreciate that realizing that all or part of the steps of above-described embodiment can pass through hardware It completes, relevant hardware can also be instructed to complete by program, the program can store in a kind of computer-readable In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..

The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims

1. a kind of vocabulary mining method based on artificial intelligence, which is characterized in that the described method includes:

The first samples of text, the second samples of text and theme dictionary are obtained, first samples of text is that target topic is corresponding Samples of text including vocabulary to be excavated, second samples of text are the corresponding text of theme similar with the target topic Sample, the theme dictionary include the multiple vocabulary for belonging to the target topic；

According to first samples of text, second samples of text and the first text identification model, determine at least one first Vocabulary, first vocabulary are that word frequency is higher than the first word frequency, in second samples of text in first samples of text Word frequency is higher than target solidification degree lower than the second word frequency, solidification degree and freedom degree is lower than the vocabulary of target freedom degree；

First samples of text and second samples of text are inputted into the second text identification model, according to second text Identification model output as a result, determine at least one second vocabulary, second vocabulary be in first samples of text Keyword and in second samples of text be non-key word vocabulary；

At least one is determined based on first samples of text, first vocabulary, second vocabulary and the theme dictionary Neologisms.

2. the method according to claim 1, wherein it is described according to first samples of text, it is described second text This sample and the first text identification model, determine at least one first vocabulary, comprising:

First samples of text is segmented to obtain at least one third vocabulary, second samples of text is segmented Obtain at least one the 4th vocabulary；

Using at least one described third vocabulary and at least one described the 4th vocabulary as the defeated of the first text identification model Enter data；

According to the output of the first text identification model as a result, determining at least one first vocabulary.

3. the method according to claim 1, wherein described by first samples of text and second text Sample input the second text identification model, according to the second text identification model output as a result, determine at least one second Vocabulary, comprising:

First samples of text and second samples of text are inputted into the second text identification model, are based on described second The figure network structure of the algorithm building text of text identification model realization；

According to the figure network structure, at least one first keyword is obtained from first samples of text, from described second At least one second keyword is obtained in samples of text；

It is deleted and at least one described duplicate vocabulary of the second keyword from least one described first keyword；

Using at least one remaining first keyword as at least one described second vocabulary.

4. the method according to claim 1, wherein described be based on first samples of text, first word It converges, second vocabulary and the theme dictionary determine at least one neologisms, comprising:

Using seed words, first vocabulary and second vocabulary in the theme dictionary as dictionary, to first text Sample is segmented, and multiple 5th vocabulary are obtained；

It is clustered according to the term vector of the term vector of the multiple 5th vocabulary and the seed words；

At least one neologisms is determined according to cluster result.

5. according to the method described in claim 4, it is characterized in that, the term vector and institute according to the multiple 5th vocabulary The term vector for stating seed words is clustered, comprising:

Connectivity Clustering Model is initialized according to the term vector of the seed words；

Mode based on similarity transmitting carries out connectivity cluster to the term vector of the multiple 5th vocabulary.

6. according to the method described in claim 5, it is characterized in that, the mode based on similarity transmitting is to the multiple the The term vector of five vocabulary carries out connectivity cluster, comprising:

For the term vector of the multiple 5th vocabulary, two term vectors that distance is less than target range are connected；

When the term vector of any 5th vocabulary is directly connected to the term vector of seed words, by the 5th vocabulary and the seed Word is as same class；

When the term vector of the term vector of any 5th vocabulary and seed words passes through the term vector indirect communication of other the 5th vocabulary, The indirect similarity of the 5th vocabulary and the seed words is determined according to shortest path, if the indirect similarity is not less than Target similarity, using the 5th vocabulary and the seed words as same class.

7. according to the method described in claim 4, it is characterized in that, described determine at least one neologisms according to cluster result, packet It includes:

Selection meets at least one noun and verb of target topic from the cluster result, as the new of the target topic Word, the target topic are theme belonging to the theme dictionary.

8. a kind of vocabulary mining device based on artificial intelligence, which is characterized in that described device includes:

Module is obtained, for obtaining the first samples of text, the second samples of text and theme dictionary, first samples of text is mesh The corresponding samples of text including vocabulary to be excavated of theme is marked, second samples of text is master similar with the target topic Corresponding samples of text is inscribed, the theme dictionary includes the multiple vocabulary for belonging to the target topic；

Determining module, for determining according to first samples of text, second samples of text and the first text identification model At least one first vocabulary, first vocabulary are that word frequency is higher than the first word frequency, described the in first samples of text Word frequency is higher than target solidification degree lower than the second word frequency, solidification degree in two samples of text and freedom degree is lower than the word of target freedom degree It converges；

The determining module is also used to first samples of text and second samples of text inputting the second text identification mould Type, according to the second text identification model output as a result, determining at least one second vocabulary, second vocabulary is in institute State be in the first samples of text keyword and in second samples of text be non-key word vocabulary；

The determining module is also used to based on first samples of text, first vocabulary, second vocabulary and the master Epigraph library determines at least one neologisms.

9. a kind of server, which is characterized in that the server includes processor and memory, and the memory is for storing journey Sequence code, said program code is as processor load and perform claim requires described in 1 to 7 any claim based on people The vocabulary mining method of work intelligence.

10. a kind of storage medium, which is characterized in that for storing program code, said program code is used for the storage medium Perform claim requires the vocabulary mining method based on artificial intelligence described in 1 to 7 any claim.