CN110457708A - Vocabulary mining method, apparatus, server and storage medium based on artificial intelligence - Google Patents
Vocabulary mining method, apparatus, server and storage medium based on artificial intelligence Download PDFInfo
- Publication number
- CN110457708A CN110457708A CN201910760785.5A CN201910760785A CN110457708A CN 110457708 A CN110457708 A CN 110457708A CN 201910760785 A CN201910760785 A CN 201910760785A CN 110457708 A CN110457708 A CN 110457708A
- Authority
- CN
- China
- Prior art keywords
- text
- vocabulary
- samples
- identification model
- term vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The present invention provides a kind of vocabulary mining method, apparatus, server and storage medium based on artificial intelligence, belongs to field of artificial intelligence.The described method includes: obtaining the first samples of text, the second samples of text and theme dictionary;According to first samples of text, second samples of text and the first text identification model, at least one first vocabulary is determined;First samples of text and second samples of text are inputted into the second text identification model, according to the second text identification model output as a result, determining at least one second vocabulary;At least one neologisms is determined based on first samples of text, first vocabulary, second vocabulary and the theme dictionary.Due to excavating vocabulary from different angles such as word frequency, solidification degree, freedom degree and phrase cooccurrence relations, so that it is determined that at least one neologisms accuracy it is high and stablize, and Training is carried out due to not needing a large amount of labeled data, so as to the manpower and material resources of saving.
Description
Technical field
The present invention relates to field of artificial intelligence, in particular to a kind of vocabulary mining method based on artificial intelligence, dress
It sets, server and storage medium.
Background technique
In artificial intelligence field, natural language processing is an important research direction, can be realized people for studying
The various theory and methods of efficient communication are carried out with natural language between computer.The key points and difficulties of natural language processing exist
In how realizing that the meaning of one's words understands and text analyzing that and the basis of semantic understanding and text analyzing is the construction of subject dictionary.And it builds
If the process of subject dictionary that is to say the process for finding neologisms by vocabulary mining.
Currently, with the continuous development of machine learning, natural language processing field there are it is a variety of based on deep learning into
The technology of row vocabulary mining acquisition neologisms.Carrying out vocabulary mining based on deep learning is usually to obtain a large amount of sample data, In
Have and comprehensive analysis is carried out to sample data under the guidance of supervision algorithm, to judge that neologisms generate the rule of variation, such as detects word
Combine variation into syllables, the co-occurrence word distribution indexs such as consistency and emotion tendency carry out the changing rule of internet new words.
Since the vocabulary mining technology based on deep learning needs a large amount of labeled data to carry out the training for having supervision, and count
A large amount of manpower and material resources can be expended according to mark, and the accuracy of Result and the accuracy of data mark are positively correlated, thus
Cause the dependence marked to data big, accuracy is unstable.
Summary of the invention
The embodiment of the invention provides a kind of, and vocabulary mining method, apparatus, server and storage based on artificial intelligence are situated between
Matter, the vocabulary mining technology for solving currently based on deep learning need a large amount of labeled data to carry out the training for having supervision,
And data mark can expend a large amount of manpower and material resources, and the accuracy of Result and the accuracy of data mark are positively correlated,
It is big so as to cause the dependence marked to data, the unstable problem of accuracy.The technical solution is as follows:
On the one hand, a kind of vocabulary mining method based on artificial intelligence is provided, which is characterized in that the described method includes:
The first samples of text, the second samples of text and theme dictionary are obtained, first samples of text is target topic pair
The samples of text including vocabulary to be excavated answered, second samples of text are that theme similar with the target topic is corresponding
Samples of text, the theme dictionary include the multiple vocabulary for belonging to the target topic;
According to first samples of text, second samples of text and the first text identification model, at least one is determined
First vocabulary, first vocabulary are that word frequency is higher than the first word frequency, in the second text sample in first samples of text
Word frequency is higher than target solidification degree lower than the second word frequency, solidification degree in this and freedom degree is lower than the vocabulary of target freedom degree;
First samples of text and second samples of text are inputted into the second text identification model, according to described second
Text identification model output as a result, determine at least one second vocabulary, second vocabulary be in first samples of text
In be keyword and in second samples of text be non-key word vocabulary;
It is determined at least based on first samples of text, first vocabulary, second vocabulary and the theme dictionary
One neologisms.
On the other hand, a kind of vocabulary mining device based on artificial intelligence is provided, which is characterized in that described device packet
It includes:
Module is obtained, for obtaining the first samples of text, the second samples of text and theme dictionary, first samples of text
For the corresponding samples of text including vocabulary to be excavated of target topic, second samples of text is similar to the target topic
The corresponding samples of text of theme, the theme dictionary includes the multiple vocabulary for belonging to the target topic;
Determining module is used for according to first samples of text, second samples of text and the first text identification model,
Determine at least one first vocabulary, first vocabulary in first samples of text word frequency be higher than the first word frequency, in institute
State that word frequency in the second samples of text is higher than target solidification degree lower than the second word frequency, solidification degree and freedom degree is lower than target freedom degree
Vocabulary;
The determining module is also used to first samples of text and second samples of text inputting the second text and know
Other model, according to the second text identification model output as a result, determining that at least one second vocabulary, second vocabulary are
In first samples of text be keyword and in second samples of text be non-key word vocabulary;
The determining module is also used to based on first samples of text, first vocabulary, second vocabulary and institute
It states theme dictionary and determines at least one neologisms.
In one possible implementation, the determining module is also used to segment first samples of text
At least one third vocabulary is obtained, second samples of text is segmented to obtain at least one the 4th vocabulary;By described in extremely
Few third vocabulary and at least one the described input data of the 4th vocabulary as the first text identification model;According to institute
The output of the first text identification model is stated as a result, determining at least one first vocabulary.
In alternatively possible implementation, the determining module is also used to first samples of text and described
Second samples of text inputs the second text identification model, and the algorithm based on the second text identification model realization constructs text
This figure network structure;According to the figure network structure, at least one first keyword is obtained from first samples of text,
At least one second keyword is obtained from second samples of text;Deletion and institute from least one described first keyword
State at least one duplicate vocabulary of the second keyword;Using at least one remaining first keyword as it is described at least one second
Vocabulary.
In alternatively possible implementation, the determining module, be also used to in the theme dictionary seed words,
First vocabulary and second vocabulary are dictionary, segment to first samples of text, obtain multiple 5th vocabulary;
It is clustered according to the term vector of the term vector of the multiple 5th vocabulary and the seed words;It is determined at least according to cluster result
One neologisms.
In alternatively possible implementation, the determining module is also used to the term vector pair according to the seed words
Connectivity Clustering Model is initialized;Mode based on similarity transmitting connects the term vector of the multiple 5th vocabulary
General character cluster.
In alternatively possible implementation, the determining module is also used to the word for the multiple 5th vocabulary
Vector connects two term vectors that distance is less than target range;When any 5th vocabulary term vector and seed words word to
When amount is directly connected to, using the 5th vocabulary and the seed words as same class;When the term vector and kind of any 5th vocabulary
When the term vector of sub- word is by the term vector indirect communications of other the 5th vocabulary, according to shortest path determine the 5th vocabulary with
The indirect similarity of the seed words, if the indirect similarity is not less than target similarity, by the 5th vocabulary and institute
Seed words are stated as same class.
In alternatively possible implementation, the determining module is also used to select to meet from the cluster result
At least one noun and verb of target topic, as the neologisms of the target topic, the target topic is the descriptor
Theme belonging to library.
On the other hand, a kind of server is provided, the server includes processor and memory, and the memory is used for
Store program code, said program code loaded by the processor and executed with realize in the embodiment of the present invention based on artificial
Performed operation in the vocabulary mining method of intelligence.
On the other hand, a kind of storage medium is provided, program code, said program code are stored in the storage medium
For executing the vocabulary mining method based on artificial intelligence in the embodiment of the present invention.
Technical solution provided in an embodiment of the present invention has the benefit that
In embodiments of the present invention, by being determined according to the first text identification model in the first samples of text medium-high frequency and tool
There is the first vocabulary of high solidification degree and low degree-of-freedom, and is determined in the first samples of text according to the second text identification model as pass
It is new to obtain at least one based on first samples of text, the first vocabulary, the second vocabulary and theme dictionary for second vocabulary of keyword
Word.Due to excavating vocabulary from different angles such as word frequency, solidification degree, freedom degree and phrase cooccurrence relations, so that it is determined that
At least one neologisms accuracy is high and stablizes, and carries out Training due to not needing a large amount of labeled data, so as to
The manpower and material resources of saving.
Detailed description of the invention
To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment
Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for
For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other
Attached drawing.
Fig. 1 is a kind of structural block diagram of the vocabulary mining system 100 based on artificial intelligence provided in an embodiment of the present invention;
Fig. 2 is a kind of vocabulary mining method flow diagram based on artificial intelligence provided in an embodiment of the present invention;
Fig. 3 is a kind of flow chart of vocabulary mining method based on artificial intelligence provided in an embodiment of the present invention;
Fig. 4 is a kind of block diagram of vocabulary mining square law device based on artificial intelligence provided in an embodiment of the present invention;
Fig. 5 is a kind of structural schematic diagram of the server provided according to embodiments of the present invention.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached drawing to embodiment party of the present invention
Formula is described in further detail.
Example embodiments are described in detail here, and the example is illustrated in the accompanying drawings.Following description is related to
When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment
Described in embodiment do not represent all embodiments consistented with the present invention.On the contrary, they be only with it is such as appended
The example of device and method being described in detail in claims, some aspects of the invention are consistent.
The embodiment of the present invention relates generally to the scene of vocabulary mining, such as in natural language processing task, for what is given
Theme carries out vocabulary mining based on topic corpus and control corpus to obtain neologisms, to carry out to existing theme dictionary
Expand.Wherein, topic corpus is text library relevant to research theme, such as the source of topic corpus can be text of transferring accounts
Originally, text, retrieval text and dialog text etc. are putd question to.Control corpus is the text of other themes similar with research theme
This library, for example, topic corpus be trade company transfer accounts text when, control corpus can be other texts of transferring accounts;Topic corpus
When being the retrieval text of certain professional domain, control corpus can be the retrieval text of relevant art.
The main flow of vocabulary mining in the embodiment of the present invention is described below:
Firstly, the first samples of text relevant to given theme is obtained, and as the second samples of text of control, and
Obtain existing theme dictionary.The vocabulary for including in the existing theme dictionary is excavated relevant to theme before being all
Vocabulary.The neologisms that the embodiment of the present invention is excavated can be used for expanding the theme dictionary.Secondly, according to the first samples of text,
Two samples of text and the theme dictionary obtain the first vocabulary and the second vocabulary respectively.Again, first vocabulary, the second word are based on
Seed words in remittance and theme dictionary are clustered.Finally, the result based on cluster determines at least one neologisms.
The technology that the embodiment of the present invention may be used simply is introduced again:
Natural language processing technique.In artificial intelligence field, natural language processing (Nature Language
Processing, NLP) it is an important research direction, natural language is used between people and computer for studying can be realized
Carry out the various theory and methods of efficient communication.Natural language processing is one and has merged linguistics, computer science and number
Science.Natural language processing technique generally includes text-processing, meaning of one's words understanding, machine translation, robot question and answer, knowledge graph
The technologies such as spectrum.
PMI (Pointwise Mutual Information) puts mutual information.PMI algorithm is that best measurement vocabulary is related
One of algorithm of degree, main focus are the bond strength (i.e. freedom degree and solidification degree) between word string.
TF-IDF (term frequency-inverse document frequency) word frequency -- inverted file frequency.
TF-IDF is a kind of common weighting technique for information retrieval and text mining, to assess a word for a file or
The repetition degree of a field file set in one corpus of person.The number that the importance of words occurs hereof with it
Directly proportional increase, but the frequency that can occur in corpus with it simultaneously is inversely proportional decline.
Word2vec(word to vector).Word2vec is the correlation model that a group is used to generate term vector, by holding in the palm
Maas rice section love (Tomas Mikolov) is created in the research team that Google is led, for generating the phase of term vector
Close model.The present invention carries out semanteme and the similarity calculation of structure between word string using the algorithm.
Connectivity cluster.Typical Representative is hierarchical clustering algorithm, it constructs cluster according to the connectivity between sample, is owned
The sample of connection belongs to the same cluster.In cluster, from seed node, cluster (class) is constructed according to similarity, that is, is passed through
The outer word of the highest cluster of similarity is selected to expand current cluster, until the word outside cluster is lower than some with the similarity of current cluster
Threshold value.
Chinese word segmentation.Participle is exactly the process that continuous word sequence is reassembled into word sequence according to certain specification.
Existing segmentation methods can be divided into three categories: the segmenting method based on string matching, the segmenting method based on understanding and be based on
The segmenting method of statistics.It is combined according to whether with part-of-speech tagging process, and simple segmenting method and participle and mark can be divided into
Infuse the integral method combined.
TextRank algorithm.TextRank algorithm is a kind of sort algorithm based on figure for text.Its basic thought
From the PageRank algorithm of Google, by the way that text segmentation at several component units (word, sentence) and is established graph model,
The important component in text is ranked up using voting mechanism, keyword can be realized merely with the information of single document itself
It extracts, digest.The models such as TextRank and LDA, HMM are different, and TextRank does not need to carry out study instruction to multiple documents in advance
Practice, is used widely because it is succinct effective.The present invention realizes the neologisms hair of word string cooccurrence relation angle using the algorithm
It is existing.
Fig. 1 is a kind of structural block diagram of the vocabulary mining system 100 based on artificial intelligence provided in an embodiment of the present invention, ginseng
See Fig. 1, the vocabulary mining system based on artificial intelligence of being somebody's turn to do includes multiple terminals 101 and vocabulary mining platform 102.
Terminal 101 is connected by wireless network or cable network with vocabulary mining platform 102.Terminal 101 can be intelligence
At least one of mobile phone, desktop computer, tablet computer and pocket computer on knee.It is applied with the system 100 in trade company
Transfer accounts theme scene for be illustrated, 101 installation and operation of terminal has the application program that support is transferred accounts.The application program can
To be financial class application program, instant messaging class application program, social category application program etc..Terminal 101 can be trade company user
The terminal used is logged in the account of trade company user in the application program run in terminal 101.
Vocabulary mining platform 102 includes at least one of a server, multiple servers and cloud computing platform.Vocabulary
Mining Platform can collect transfer data when user is transferred accounts by terminal 101, wherein the transfer data got is to use
The data that family has authorized.
Optionally, vocabulary mining platform 102 includes: access server, vocabulary mining server and database.Access service
Device is for providing the access service of terminal 101.Vocabulary mining server is for providing vocabulary mining service.Database is for storing
Samples of text and theme dictionary etc..Vocabulary mining server can be one or more.When vocabulary mining server is more,
It is used to provide different services in the presence of at least two vocabulary mining servers, and/or, there are at least two vocabulary mining servers
Same service is provided for providing identical service, such as with load balancing mode, the embodiment of the present application is to this without limit
It is fixed.The first text identification model and the second text identification model can be set in vocabulary mining server.Implement in the application
In example, the first text identification model is for determining that high frequency solidifies word vocabulary, and the second text identification model is for determining keyword word
It converges.
Fig. 2 is a kind of vocabulary mining method flow diagram based on artificial intelligence provided in an embodiment of the present invention, such as Fig. 2 institute
Show, is illustrated for being applied in server in embodiments of the present invention.The vocabulary mining method based on artificial intelligence
The following steps are included:
201, server obtains the first samples of text, the second samples of text and theme dictionary, which is mesh
The corresponding samples of text including vocabulary to be excavated of theme is marked, which is that theme similar with target topic is corresponding
Samples of text, which includes the multiple vocabulary for belonging to the target topic.
In this step, server can obtain the samples of text including vocabulary to be excavated according to given target topic and make
For first samples of text, such as given target topic is trade company, then server can will be marked as turning for the user of trade company
Account text is as first samples of text.Further, which can be the target time section of server acquisition
Samples of text, such as nearest one month text of transferring accounts, recently trimestral text of transferring accounts.The text of transferring accounts is trade company user
The text inputted when transferring accounts can be used to indicate that the purpose transferred accounts or want to beneficiary word, such as " XX payment for goods ", " XX
Commodity " etc..
In this step, server can determine other themes similar with the theme, root according to given target topic
The second samples of text is obtained according to other themes.As the first samples of text be trade company user transfer accounts text when, the second text sample
It originally can be with the text of transferring accounts of right and wrong trade company user.By the way that pair of the second samples of text of similar topic as first sample is arranged
According to enabling the neologisms excavated more effectively to distinguish with similar theme, thus the degree of correlation of neologisms and target topic
It is high.
In this step, server can obtain and target master according to the corresponding relationship of theme and theme library from database
Inscribe corresponding theme dictionary.It include multiple trade companies in trade company's dictionary if the corresponding theme dictionary of trade company's theme is trade company's dictionary
Vocabulary.Wherein, the vocabulary in the theme dictionary can be the vocabulary obtained by other means, be also possible to real through the invention
The vocabulary that the method for applying example offer obtains.
202a, server determine at least one according to the first samples of text, the second samples of text and the first text identification model
A first vocabulary, first vocabulary are that word frequency is higher than the first word frequency, the word frequency in the second samples of text in the first samples of text
It is higher than target solidification degree lower than the second word frequency, solidification degree and freedom degree is lower than the vocabulary of target freedom degree.
In this step, server can pre-process the first samples of text and the second samples of text, the pretreatment
Process can be the process segmented to first samples of text and the second samples of text.The first text identification model can be with
It include realizing the first algoritic module of PMI algorithm and realizing the second of TF-IDF algorithm for PMI-TF-IDF model, in the model
Algoritic module.
In an optional implementation manner, server pre-processes the first samples of text and the second samples of text,
The step of determining at least one first vocabulary according to the first text identification model can be with are as follows: server can be to the first samples of text
It is segmented to obtain at least one third vocabulary, the second samples of text is segmented to obtain at least one the 4th vocabulary.Service
Device can be using at least one obtained third vocabulary and at least one obtained the 4th vocabulary as the first text identification model
Input data, according to the output of the first text identification model as a result, determining at least one first vocabulary.
In this step, server can be according to realizing the first algoritic module of PMI algorithm from least one obtained the
At least one vocabulary that solidification degree is higher than target solidification degree and freedom degree is lower than target freedom degree is extracted in three vocabulary, from what is obtained
At least one word that solidification degree is higher than target solidification degree and freedom degree is lower than target freedom degree is extracted at least one the 4th vocabulary
It converges.
Wherein, solidification degree refers to the tightness degree in vocabulary between word and word, the vocabulary as " coloured glaze ", " durian "
Solidification degree it is just very high, and the solidification degree of word as " child ", " combination " is with regard to relatively low.Freedom degree refers to word in vocabulary
The degree that can freely use, such as the solidification degree degree that is higher, and freely using of " chalk " and " gram force " in " chocolate "
It is almost nil, namely " chalk " and " gram force " cannot be individually at word.Target solidification degree and target freedom degree can be according to practical need
It asks and is configured, the embodiment of the present invention is without concrete restriction.
In this step, server can be according to realizing the second algoritic module of TD-IDF algorithm from least one obtained
The high frequency vocabulary that the word frequency in the first samples of text is higher than the first word frequency is extracted in third vocabulary, from least one obtained 4th
The low frequency words that the word frequency in the second samples of text is lower than the second word frequency are extracted in vocabulary.Server can be by high frequency vocabulary and low
At least one identical vocabulary is selected in frequency vocabulary.Wherein, the first word frequency and the second word frequency can be set according to actual needs
It sets, the embodiment of the present invention is without concrete restriction.
It should be noted that server can first pass through the word that the first algoritic module extracts high solidification degree and low degree-of-freedom
Converge, then high frequency vocabulary and low frequency words are extracted from the vocabulary of extraction by the second algoritic module, thus obtain at least one the
One vocabulary;Server can also first pass through the second algoritic module and extract high frequency vocabulary and low frequency words, then pass through the first algorithm mould
Block extracts the vocabulary of high solidification degree and low degree-of-freedom, to obtain at least one first vocabulary;Server can also pass through respectively
First algoritic module and the second algoritic module extract vocabulary, the intersection of the vocabulary extracted are then sought, to obtain at least one
A first vocabulary;Server can also extract vocabulary by the first algoritic module and the second algoritic module simultaneously, directly obtain to
Few first vocabulary.
202b, server are using the first samples of text and the second samples of text as the input number of the second text identification model
According to according to the second text identification model output as a result, determining at least one second vocabulary, the second vocabulary is in the first text
Be in this sample keyword and in the second samples of text be non-key word vocabulary.
In this step, which can be TextRank model, include in the model for realizing
The third algorithm module of TextRank algorithm.Server can determine at least one second vocabulary according to the third algorithm module.
In an optional implementation manner, server determines the step of at least one the second vocabulary according to third algorithm module
It suddenly can be with are as follows: the first samples of text and the second samples of text can be inputted the second text identification model by server, and server can
With the third algorithm module for including based on the second text identification model, namely realize that the third algorithm module of TextRank algorithm is come
Construct the figure network structure of text.Server can obtain at least one from the first samples of text according to the figure network structure
First key vocabularies obtain at least one second key vocabularies from the second samples of text.Server can from least one
In one keyword delete at least one duplicate vocabulary of the second key vocabularies, using at least one remaining first keyword as
At least one second vocabulary.
In this step, server can be stored figure network structure with Node2Node (point-to-point) structure type.
And when the content that the first samples of text and/or the second samples of text include is more, third algorithm module realizes that TextRank is calculated
A large amount of memory is needed to carry out structural map network structure when method, server can optimize the storage mode of figure network structure.
In an optional implementation manner, server can tie the storage mode of figure network structure from Node2Node
Structure is optimized for the structure of EdgeList (effective edge) namely above-mentioned figure network structure can be the figure network knot based on effective edge
Structure.
It should be noted that step 202a and step 202b may be performed simultaneously, held after step 202a can also be first carried out
Row step 202b executes step 202a after can also first carrying out step 202b, and it is not limited in the embodiment of the present invention.
203, server carries out the first samples of text according to seed words, the first vocabulary and the second vocabulary in theme dictionary
Participle, obtains multiple 5th vocabulary.
In this step, server can obtain at least one seed words from theme dictionary., write inscription based on the seed words
Weight is not less than the vocabulary of target weight threshold value in library.
In an optional implementation manner, server the step of at least one seed words is obtained from theme dictionary can be with
Are as follows: server determines the weight of each vocabulary in the theme dictionary, when the weight of any vocabulary is not less than target weight threshold value,
Using the vocabulary as seed words.
Certainly, in the present embodiment, server obtains a seed words.Server can be after obtaining theme dictionary
At least one seed words is directly acquired, at least one seed words, the embodiment of the present invention pair can also be obtained when executing this step
This is without concrete restriction.When theme Word library updating, server can reacquire seed words.
In this step, since the first vocabulary and the second vocabulary are that server passes through the first text identification model and the second text
The vocabulary that this identification model obtains, the first vocabulary and the second vocabulary can be stored in candidate dictionary by server, with the candidate
The seed words in vocabulary and theme dictionary in dictionary are dictionary, segment to the first samples of text, obtain multiple 5th words
It converges.Using the first vocabulary, the second vocabulary and seed words as dictionary when due to being segmented to the first samples of text, obtain multiple the
Five vocabulary more meet theme, and are more in line with natural language.
In this step, server can carry out vectorization to obtained multiple 5th vocabulary according to Word2vec model,
To obtain the term vector of multiple 5th vocabulary.
In an optional implementation manner, the parameter of the adjustable Word2vec model of server, come so that
Word2vec model is more suitable for samples of text.If the default value of parameter window-size (window size) is 10, it is suitable for place
Long text is managed, wherein window indicates current vocabulary and predicts maximum distance of the vocabulary in sentence is how many.Work as window-
When size is tuned up, term vector can be made increasingly similar with theme;When window-size is turned down, can term vector be produced
The similitude of raw more functions and syntax.For another example the default value of parameter vector-size (vector number) is 100, when
When vector-size is tuned up, the accuracy of the similarity between vocabulary and vocabulary can decline, and the dimension of similarity will increase.
For example, server when handling trade company's theme, since the text of transferring accounts of trade company is mostly short and small, has the spy of fragmentation
Point can be not suitable for the text of transferring accounts of trade company the default value of this parameter of window-size, therefore server can incite somebody to action
Window-size tunes up the text of transferring accounts that trade company is handled for 25.
For another example, for this parameter of vector-size, for example, when vector-size is smaller, " Tai Di " is more
Can and " dog food ", " traction rope " link together, and when vector-size is gradually increased, " Tai Di " starts and " Persian cat ",
Vocabulary such as " pets " link together, and can reduce with the similarity of the vocabulary such as " dog food ", " traction rope ".Therefore, server is being located
When managing trade company's theme, the value of vector-size, the vector- can be determined according to the total number of vocabulary in samples of text
The value of size can be sqrt (| V |)/2, wherein square root is sought in sqrt () expression, | V | indicate vocabulary total number.
204, server is clustered according to the term vector of multiple 5th vocabulary and the term vector of the seed words.
In this step, server can be clustered by connectivity Clustering Model, and the term vector of seed words can be with
For initializing the connectivity Clustering Model.Wherein, term vector is expression of the natural language vocabulary in word space, between vocabulary
Distance represents the similarity between vocabulary.
In an optional implementation manner, server is carried out according to the term vector of multiple 5th vocabulary and the seed words
Sorting procedure can be with are as follows: server can initialize connectivity Clustering Model according to seed words, be then based on similarity
The mode of transmitting carries out connectivity cluster to the term vector of multiple 5th vocabulary.
Specifically, server carries out connectivity cluster to the term vector of multiple 5th vocabulary based on the mode that similarity is transmitted
The step of can be with are as follows: for the term vector of multiple 5th vocabulary, server distance can be less than two words of target range to
Amount connection, when the term vector of any 5th vocabulary is directly connected to the term vector of seed words, server can be by the 5th word
It converges with seed words as same class, is classified as same clustering cluster;When the term vector of any 5th vocabulary and the term vector of seed words are logical
When crossing the term vector indirect communication of other the 5th vocabulary, server can determine the 5th vocabulary and seed words according to shortest path
Indirect similarity, if the indirect similarity is not less than target similarity, using the 5th vocabulary and the seed words as together
One kind is classified as same clustering cluster.Wherein, target range and target similarity can be configured according to actual needs, the present invention
Embodiment is to this without concrete restriction.The indirect similarity is true by the transmission path and attenuation coefficient of the vocabulary and seed words
It is fixed.
For example, target similarity is 0.75, and target range 2, two term vectors connection of the server by distance less than 2,
For seed words Z, there is the 5th vocabulary A and B directly to connect, A is connect with C, D, and B is connect with E, and D is connect with F namely C and D
It can be connected to by A with Z, E can be connected to by B with Z, and F can be connected to by D and A with Z.D is 3 via the distance of A to Z, is led to
Overdamping coefficient determines that the indirect similarity of D and Z is 0.8, based on same mode it can be concluded that the indirect similarity of C and Z is
The indirect similarity of 0.78, E and Z is 0.83, and the shortest path of F to Z is 5, and the indirect phase of F and Z is determined by attenuation coefficient
It is 0.6 like degree.To which A, B, C, D, E and Z can be classified as same clustering cluster.
205, server determines at least one neologisms according to cluster result.
In this step, server can be by cluster result directly as neologisms, so that it is determined that at least one neologisms;Service
Device can also screen cluster result, determine at least one neologisms.
In an optional implementation manner, server screens cluster result, determines the step of at least one neologisms
It suddenly can be with are as follows: server can select at least one noun for meeting target topic and verb from cluster result, as target
The neologisms of theme, the target topic are the theme theme belonging to dictionary.
For example, server can reject the vocabulary of some versatilities from cluster result, such as name, address, blessing language, day
Phase, particular meaning vocabulary etc., to retain the noun and verb for meeting target topic, such as " tealeaves ", " trendy ", " upper new ".
It should be noted that above-mentioned steps 203 to step 204 is that server is based on the first samples of text, the first vocabulary, the
A kind of optional implementation that two vocabulary and theme dictionary are clustered, the method that server is also based on figure e-learning
The entire neighborhood information is utilized to be clustered, the present invention is to this without concrete restriction.
It should also be noted that, may refer to Fig. 3 in order to which the process for executing above-mentioned steps 201 to step 205 is more clear
Shown, Fig. 3 is a kind of flow chart of vocabulary mining method based on artificial intelligence provided in an embodiment of the present invention.It is divided into four portions
Point, first part corresponds to step 201, and second part corresponds to step 202a and step 202b, and Part III corresponds to step 203 and step
Rapid 204, Part IV corresponds to step 205.
In embodiments of the present invention, by being determined according to the first text identification model in the first samples of text medium-high frequency and tool
There is the first vocabulary of high solidification degree and low degree-of-freedom, and is determined in the first samples of text according to the second text identification model as pass
It is new to obtain at least one based on first samples of text, the first vocabulary, the second vocabulary and theme dictionary for second vocabulary of keyword
Word.Due to excavating vocabulary from different angles such as word frequency, solidification degree, freedom degree and phrase cooccurrence relations, so that it is determined that
At least one neologisms accuracy is high and stablizes, and carries out Training due to not needing a large amount of labeled data, so as to
The manpower and material resources of saving.
Fig. 4 is a kind of block diagram of the vocabulary mining square law device based on artificial intelligence provided according to an exemplary embodiment.
The device is used to execute the step of when the above-mentioned vocabulary mining method based on artificial intelligence executes, and referring to fig. 4, device includes: to obtain
Modulus block 401 and determining module 402.
Module 401 is obtained, for obtaining the first samples of text, the second samples of text and theme dictionary, the first samples of text
For the corresponding samples of text including vocabulary to be excavated of target topic, the second samples of text is theme pair similar with target topic
The samples of text answered, theme dictionary include the multiple vocabulary for belonging to target topic;
Determining module 402, for determining according to the first samples of text, the second samples of text and the first text identification model
At least one first vocabulary, the first vocabulary are that word frequency is higher than the first word frequency, in the second samples of text in the first samples of text
Word frequency is higher than target solidification degree lower than the second word frequency, solidification degree and freedom degree is lower than the vocabulary of target freedom degree;
Determining module 402 is also used to the first samples of text and the second samples of text inputting the second text identification model, root
According to the output of the second text identification model as a result, determining at least one second vocabulary, the second vocabulary is in the first samples of text
It is the vocabulary of non-key word for keyword and in the second samples of text;
Determining module 402 is also used to determine extremely based on the first samples of text, the first vocabulary, the second vocabulary and theme dictionary
Few neologisms.
In one possible implementation, determining module 402, be also used to that the first samples of text is segmented to obtain to
Few third vocabulary, segments the second samples of text to obtain at least one the 4th vocabulary;By at least one third vocabulary
With at least one input data of the 4th vocabulary as the first text identification model;According to the output knot of the first text identification model
Fruit determines at least one first vocabulary.
In alternatively possible implementation, determining module 402 is also used to the first samples of text and the second text sample
This input the second text identification model, the figure network structure of the algorithm building text based on the second text identification model realization;Root
According to figure network structure, at least one first keyword is obtained from the first samples of text, is obtained at least from the second samples of text
One the second keyword;It is deleted and at least one duplicate vocabulary of the second keyword from least one first keyword;It will remain
First keyword of at least one remaining is as at least one the second vocabulary.
In alternatively possible implementation, determining module 402, be also used to in theme dictionary seed words, first
Vocabulary and the second vocabulary are dictionary, segment to the first samples of text, obtain multiple 5th vocabulary;According to multiple 5th vocabulary
Term vector and the term vectors of seed words clustered;At least one neologisms is determined according to cluster result.
In alternatively possible implementation, determining module 402 is also used to the term vector according to seed words to connectivity
Clustering Model is initialized;Mode based on similarity transmitting carries out connectivity cluster to the term vector of multiple 5th vocabulary.
In alternatively possible implementation, determining module 402 is also used to the term vector for multiple 5th vocabulary,
Two term vectors that distance is less than target range are connected;When the term vector of any 5th vocabulary and the term vector of seed words are direct
When connection, using the 5th vocabulary and seed words as same class;When the term vector of any 5th vocabulary and the term vector of seed words are logical
When crossing the term vector indirect communication of other the 5th vocabulary, determine that the 5th vocabulary is indirect similar to seed words according to shortest path
Degree, if similarity is not less than target similarity indirectly, using the 5th vocabulary and seed words as same class.
In alternatively possible implementation, determining module 402 is also used to select to meet target master from cluster result
At least one noun and verb of topic, as the neologisms of target topic, target topic is the theme theme belonging to dictionary.
In embodiments of the present invention, by being determined according to the first text identification model in the first samples of text medium-high frequency and tool
There is the first vocabulary of high solidification degree and low degree-of-freedom, and is determined in the first samples of text according to the second text identification model as pass
It is new to obtain at least one based on first samples of text, the first vocabulary, the second vocabulary and theme dictionary for second vocabulary of keyword
Word.Due to excavating vocabulary from different angles such as word frequency, solidification degree, freedom degree and phrase cooccurrence relations, so that it is determined that
At least one neologisms accuracy is high and stablizes, and carries out Training due to not needing a large amount of labeled data, so as to
The manpower and material resources of saving.
It should be understood that the vocabulary mining device provided by the above embodiment based on artificial intelligence is in operation application program
When, only the example of the division of the above functional modules, in practical application, it can according to need and divide above-mentioned function
With being completed by different functional modules, i.e., the internal structure of device is divided into different functional modules, to complete above description
All or part of function.In addition, the vocabulary mining device provided by the above embodiment based on artificial intelligence with based on artificial
The vocabulary mining embodiment of the method for intelligence belongs to same design, and specific implementation process is detailed in embodiment of the method, no longer superfluous here
It states.
Fig. 5 is a kind of structural schematic diagram of server provided in an embodiment of the present invention, which can be because of configuration or property
Energy is different and generates bigger difference, may include one or more processors (central processing
Units, CPU) 501 and one or more memory 502, wherein at least one finger is stored in the memory 502
It enables, at least one instruction is loaded by the processor 501 and executed the side to realize above-mentioned each embodiment of the method offer
Method.Certainly, which can also have the components such as wired or wireless network interface, keyboard and input/output interface, so as to
Input and output are carried out, which can also include other for realizing the component of functions of the equipments, and this will not be repeated here.
The embodiment of the invention also provides a kind of computer readable storage medium, which is applied to
Server is stored with program code in the computer readable storage medium, which is loaded by processor and executed with reality
Operation performed by server in the vocabulary mining method based on artificial intelligence of existing above-described embodiment.
Those of ordinary skill in the art will appreciate that realizing that all or part of the steps of above-described embodiment can pass through hardware
It completes, relevant hardware can also be instructed to complete by program, the program can store in a kind of computer-readable
In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and
Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.
Claims (10)
1. a kind of vocabulary mining method based on artificial intelligence, which is characterized in that the described method includes:
The first samples of text, the second samples of text and theme dictionary are obtained, first samples of text is that target topic is corresponding
Samples of text including vocabulary to be excavated, second samples of text are the corresponding text of theme similar with the target topic
Sample, the theme dictionary include the multiple vocabulary for belonging to the target topic;
According to first samples of text, second samples of text and the first text identification model, determine at least one first
Vocabulary, first vocabulary are that word frequency is higher than the first word frequency, in second samples of text in first samples of text
Word frequency is higher than target solidification degree lower than the second word frequency, solidification degree and freedom degree is lower than the vocabulary of target freedom degree;
First samples of text and second samples of text are inputted into the second text identification model, according to second text
Identification model output as a result, determine at least one second vocabulary, second vocabulary be in first samples of text
Keyword and in second samples of text be non-key word vocabulary;
At least one is determined based on first samples of text, first vocabulary, second vocabulary and the theme dictionary
Neologisms.
2. the method according to claim 1, wherein it is described according to first samples of text, it is described second text
This sample and the first text identification model, determine at least one first vocabulary, comprising:
First samples of text is segmented to obtain at least one third vocabulary, second samples of text is segmented
Obtain at least one the 4th vocabulary;
Using at least one described third vocabulary and at least one described the 4th vocabulary as the defeated of the first text identification model
Enter data;
According to the output of the first text identification model as a result, determining at least one first vocabulary.
3. the method according to claim 1, wherein described by first samples of text and second text
Sample input the second text identification model, according to the second text identification model output as a result, determine at least one second
Vocabulary, comprising:
First samples of text and second samples of text are inputted into the second text identification model, are based on described second
The figure network structure of the algorithm building text of text identification model realization;
According to the figure network structure, at least one first keyword is obtained from first samples of text, from described second
At least one second keyword is obtained in samples of text;
It is deleted and at least one described duplicate vocabulary of the second keyword from least one described first keyword;
Using at least one remaining first keyword as at least one described second vocabulary.
4. the method according to claim 1, wherein described be based on first samples of text, first word
It converges, second vocabulary and the theme dictionary determine at least one neologisms, comprising:
Using seed words, first vocabulary and second vocabulary in the theme dictionary as dictionary, to first text
Sample is segmented, and multiple 5th vocabulary are obtained;
It is clustered according to the term vector of the term vector of the multiple 5th vocabulary and the seed words;
At least one neologisms is determined according to cluster result.
5. according to the method described in claim 4, it is characterized in that, the term vector and institute according to the multiple 5th vocabulary
The term vector for stating seed words is clustered, comprising:
Connectivity Clustering Model is initialized according to the term vector of the seed words;
Mode based on similarity transmitting carries out connectivity cluster to the term vector of the multiple 5th vocabulary.
6. according to the method described in claim 5, it is characterized in that, the mode based on similarity transmitting is to the multiple the
The term vector of five vocabulary carries out connectivity cluster, comprising:
For the term vector of the multiple 5th vocabulary, two term vectors that distance is less than target range are connected;
When the term vector of any 5th vocabulary is directly connected to the term vector of seed words, by the 5th vocabulary and the seed
Word is as same class;
When the term vector of the term vector of any 5th vocabulary and seed words passes through the term vector indirect communication of other the 5th vocabulary,
The indirect similarity of the 5th vocabulary and the seed words is determined according to shortest path, if the indirect similarity is not less than
Target similarity, using the 5th vocabulary and the seed words as same class.
7. according to the method described in claim 4, it is characterized in that, described determine at least one neologisms according to cluster result, packet
It includes:
Selection meets at least one noun and verb of target topic from the cluster result, as the new of the target topic
Word, the target topic are theme belonging to the theme dictionary.
8. a kind of vocabulary mining device based on artificial intelligence, which is characterized in that described device includes:
Module is obtained, for obtaining the first samples of text, the second samples of text and theme dictionary, first samples of text is mesh
The corresponding samples of text including vocabulary to be excavated of theme is marked, second samples of text is master similar with the target topic
Corresponding samples of text is inscribed, the theme dictionary includes the multiple vocabulary for belonging to the target topic;
Determining module, for determining according to first samples of text, second samples of text and the first text identification model
At least one first vocabulary, first vocabulary are that word frequency is higher than the first word frequency, described the in first samples of text
Word frequency is higher than target solidification degree lower than the second word frequency, solidification degree in two samples of text and freedom degree is lower than the word of target freedom degree
It converges;
The determining module is also used to first samples of text and second samples of text inputting the second text identification mould
Type, according to the second text identification model output as a result, determining at least one second vocabulary, second vocabulary is in institute
State be in the first samples of text keyword and in second samples of text be non-key word vocabulary;
The determining module is also used to based on first samples of text, first vocabulary, second vocabulary and the master
Epigraph library determines at least one neologisms.
9. a kind of server, which is characterized in that the server includes processor and memory, and the memory is for storing journey
Sequence code, said program code is as processor load and perform claim requires described in 1 to 7 any claim based on people
The vocabulary mining method of work intelligence.
10. a kind of storage medium, which is characterized in that for storing program code, said program code is used for the storage medium
Perform claim requires the vocabulary mining method based on artificial intelligence described in 1 to 7 any claim.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910760785.5A CN110457708B (en) | 2019-08-16 | 2019-08-16 | Vocabulary mining method and device based on artificial intelligence, server and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910760785.5A CN110457708B (en) | 2019-08-16 | 2019-08-16 | Vocabulary mining method and device based on artificial intelligence, server and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110457708A true CN110457708A (en) | 2019-11-15 |
CN110457708B CN110457708B (en) | 2023-05-16 |
Family
ID=68487239
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910760785.5A Active CN110457708B (en) | 2019-08-16 | 2019-08-16 | Vocabulary mining method and device based on artificial intelligence, server and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110457708B (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111626054A (en) * | 2020-05-21 | 2020-09-04 | 北京明亿科技有限公司 | New illegal behavior descriptor identification method and device, electronic equipment and storage medium |
CN111931501A (en) * | 2020-09-22 | 2020-11-13 | 腾讯科技(深圳)有限公司 | Text mining method based on artificial intelligence, related device and equipment |
CN112948570A (en) * | 2019-12-11 | 2021-06-11 | 复旦大学 | Unsupervised automatic domain knowledge map construction system |
CN113011875A (en) * | 2021-01-12 | 2021-06-22 | 腾讯科技(深圳)有限公司 | Text processing method and device, computer equipment and storage medium |
CN113590797A (en) * | 2021-08-05 | 2021-11-02 | 云上贵州大数据产业发展有限公司 | Intelligent operation and maintenance customer service system and implementation method |
CN113609844A (en) * | 2021-07-30 | 2021-11-05 | 国网山西省电力公司晋城供电公司 | Electric power professional word bank construction method based on hybrid model and clustering algorithm |
CN114444514A (en) * | 2022-02-08 | 2022-05-06 | 北京百度网讯科技有限公司 | Semantic matching model training method, semantic matching method and related device |
CN115879515A (en) * | 2023-02-20 | 2023-03-31 | 江西财经大学 | Document network theme modeling method, variation neighborhood encoder, terminal and medium |
CN116304016A (en) * | 2022-12-29 | 2023-06-23 | 太和康美(北京)中医研究院有限公司 | Method and device for analyzing commonality of documents |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6446061B1 (en) * | 1998-07-31 | 2002-09-03 | International Business Machines Corporation | Taxonomy generation for document collections |
US20090326927A1 (en) * | 2008-06-27 | 2009-12-31 | Microsoft Corporation | Adaptive generation of out-of-dictionary personalized long words |
US20110078167A1 (en) * | 2009-09-28 | 2011-03-31 | Neelakantan Sundaresan | System and method for topic extraction and opinion mining |
CN107391486A (en) * | 2017-07-20 | 2017-11-24 | 南京云问网络技术有限公司 | A kind of field new word identification method based on statistical information and sequence labelling |
CN107657048A (en) * | 2017-09-21 | 2018-02-02 | 北京麒麟合盛网络技术有限公司 | user identification method and device |
CN108038119A (en) * | 2017-11-01 | 2018-05-15 | 平安科技(深圳)有限公司 | Utilize the method, apparatus and storage medium of new word discovery investment target |
WO2019076191A1 (en) * | 2017-10-20 | 2019-04-25 | 腾讯科技(深圳)有限公司 | Keyword extraction method and device, and storage medium and electronic device |
CN109766408A (en) * | 2018-12-04 | 2019-05-17 | 上海大学 | The text key word weighing computation method of comprehensive word positional factor and word frequency factor |
CN109766544A (en) * | 2018-12-24 | 2019-05-17 | 中国科学院合肥物质科学研究院 | Document keyword abstraction method and device based on LDA and term vector |
CN109858010A (en) * | 2018-11-26 | 2019-06-07 | 平安科技(深圳)有限公司 | Field new word identification method, device, computer equipment and storage medium |
-
2019
- 2019-08-16 CN CN201910760785.5A patent/CN110457708B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6446061B1 (en) * | 1998-07-31 | 2002-09-03 | International Business Machines Corporation | Taxonomy generation for document collections |
US20090326927A1 (en) * | 2008-06-27 | 2009-12-31 | Microsoft Corporation | Adaptive generation of out-of-dictionary personalized long words |
US20110078167A1 (en) * | 2009-09-28 | 2011-03-31 | Neelakantan Sundaresan | System and method for topic extraction and opinion mining |
CN107391486A (en) * | 2017-07-20 | 2017-11-24 | 南京云问网络技术有限公司 | A kind of field new word identification method based on statistical information and sequence labelling |
CN107657048A (en) * | 2017-09-21 | 2018-02-02 | 北京麒麟合盛网络技术有限公司 | user identification method and device |
WO2019076191A1 (en) * | 2017-10-20 | 2019-04-25 | 腾讯科技(深圳)有限公司 | Keyword extraction method and device, and storage medium and electronic device |
CN108038119A (en) * | 2017-11-01 | 2018-05-15 | 平安科技(深圳)有限公司 | Utilize the method, apparatus and storage medium of new word discovery investment target |
CN109858010A (en) * | 2018-11-26 | 2019-06-07 | 平安科技(深圳)有限公司 | Field new word identification method, device, computer equipment and storage medium |
CN109766408A (en) * | 2018-12-04 | 2019-05-17 | 上海大学 | The text key word weighing computation method of comprehensive word positional factor and word frequency factor |
CN109766544A (en) * | 2018-12-24 | 2019-05-17 | 中国科学院合肥物质科学研究院 | Document keyword abstraction method and device based on LDA and term vector |
Non-Patent Citations (4)
Title |
---|
LIPING DU: "Chinese term extraction from web pages based on expected point-wise mutual information", 2016 12TH INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION, FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY (ICNC-FSKD) * |
代六玲,黄河燕,陈肇雄: "中文文本分类中特征抽取方法的比较研究", 中文信息学报 * |
李筱瑜;: "基于新词发现与词典信息的古籍文本分词研究", 软件导刊 * |
陈炯;张永奎;: "一种基于词聚类的文本特征描述方法", 计算机系统应用 * |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112948570A (en) * | 2019-12-11 | 2021-06-11 | 复旦大学 | Unsupervised automatic domain knowledge map construction system |
CN111626054B (en) * | 2020-05-21 | 2023-12-19 | 北京明亿科技有限公司 | Novel illegal action descriptor recognition method and device, electronic equipment and storage medium |
CN111626054A (en) * | 2020-05-21 | 2020-09-04 | 北京明亿科技有限公司 | New illegal behavior descriptor identification method and device, electronic equipment and storage medium |
CN111931501A (en) * | 2020-09-22 | 2020-11-13 | 腾讯科技(深圳)有限公司 | Text mining method based on artificial intelligence, related device and equipment |
CN111931501B (en) * | 2020-09-22 | 2021-01-08 | 腾讯科技(深圳)有限公司 | Text mining method based on artificial intelligence, related device and equipment |
CN113011875A (en) * | 2021-01-12 | 2021-06-22 | 腾讯科技(深圳)有限公司 | Text processing method and device, computer equipment and storage medium |
CN113011875B (en) * | 2021-01-12 | 2024-03-29 | 腾讯科技(深圳)有限公司 | Text processing method, text processing device, computer equipment and storage medium |
CN113609844A (en) * | 2021-07-30 | 2021-11-05 | 国网山西省电力公司晋城供电公司 | Electric power professional word bank construction method based on hybrid model and clustering algorithm |
CN113609844B (en) * | 2021-07-30 | 2024-03-08 | 国网山西省电力公司晋城供电公司 | Electric power professional word stock construction method based on hybrid model and clustering algorithm |
CN113590797A (en) * | 2021-08-05 | 2021-11-02 | 云上贵州大数据产业发展有限公司 | Intelligent operation and maintenance customer service system and implementation method |
CN114444514A (en) * | 2022-02-08 | 2022-05-06 | 北京百度网讯科技有限公司 | Semantic matching model training method, semantic matching method and related device |
CN116304016A (en) * | 2022-12-29 | 2023-06-23 | 太和康美(北京)中医研究院有限公司 | Method and device for analyzing commonality of documents |
CN116304016B (en) * | 2022-12-29 | 2023-10-10 | 太和康美(北京)中医研究院有限公司 | Method and device for analyzing commonality of documents |
CN115879515A (en) * | 2023-02-20 | 2023-03-31 | 江西财经大学 | Document network theme modeling method, variation neighborhood encoder, terminal and medium |
Also Published As
Publication number | Publication date |
---|---|
CN110457708B (en) | 2023-05-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110457708A (en) | Vocabulary mining method, apparatus, server and storage medium based on artificial intelligence | |
CN110337645B (en) | Adaptable processing assembly | |
TW202009749A (en) | Human-machine dialog method, device, electronic apparatus and computer readable medium | |
CN111143576A (en) | Event-oriented dynamic knowledge graph construction method and device | |
US20190361977A1 (en) | Training data expansion for natural language classification | |
CN111931500B (en) | Search information processing method and device | |
US11423070B2 (en) | System, computer program product and method for generating embeddings of textual and quantitative data | |
CN109165386A (en) | A kind of Chinese empty anaphora resolution method and system | |
CN111858935A (en) | Fine-grained emotion classification system for flight comment | |
CN113095080B (en) | Theme-based semantic recognition method and device, electronic equipment and storage medium | |
CN111194401B (en) | Abstraction and portability of intent recognition | |
CN110162771A (en) | The recognition methods of event trigger word, device, electronic equipment | |
ALBayari et al. | Cyberbullying classification methods for Arabic: A systematic review | |
CN113392179A (en) | Text labeling method and device, electronic equipment and storage medium | |
Chandola et al. | Online resume parsing system using text analytics | |
US11361031B2 (en) | Dynamic linguistic assessment and measurement | |
Nazarizadeh et al. | Sentiment analysis of Persian language: review of algorithms, approaches and datasets | |
Corredera Arbide et al. | Affective computing for smart operations: a survey and comparative analysis of the available tools, libraries and web services | |
Zhang | Sentiment analysis of Chinese commodity reviews based on deep learning | |
Iorliam et al. | A Comparative Analysis of Generative Artificial Intelligence Tools for Natural Language Processing | |
CN114398482A (en) | Dictionary construction method and device, electronic equipment and storage medium | |
Buyukbas et al. | Explainability in Irony detection | |
Walsh | Natural Language Processing | |
Khandare et al. | Study of Python libraries for NLP | |
Kubis et al. | EUDAMU at SemEval-2017 task 11: Action ranking and type matching for end-user development |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |