CN110175331A - Recognition methods, device, electronic equipment and the readable storage medium storing program for executing of technical term - Google Patents
Recognition methods, device, electronic equipment and the readable storage medium storing program for executing of technical term Download PDFInfo
- Publication number
- CN110175331A CN110175331A CN201910457246.4A CN201910457246A CN110175331A CN 110175331 A CN110175331 A CN 110175331A CN 201910457246 A CN201910457246 A CN 201910457246A CN 110175331 A CN110175331 A CN 110175331A
- Authority
- CN
- China
- Prior art keywords
- corpus
- technical term
- vocabulary
- frequency value
- reverse document
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Probability & Statistics with Applications (AREA)
- Machine Translation (AREA)
Abstract
This application provides a kind of recognition methods of technical term, device, electronic equipment and computer readable storage mediums, are related to natural language processing field.This method comprises: obtaining the first corpus of the corresponding professional domain of technical term, with second corpus in amateur field, it is then based on the first corpus and the second corpus, reverse document-frequency value is obtained from the first corpus and is greater than the vocabulary for presetting reverse document-frequency value, and vocabulary is determined as technical term.The application can more completely identify technical term by way of contrast, to improve the discrimination of technical term, and then improve the quality of natural language processing.Further, based on the location information of technical term, new technical term is obtained from the first corpus, further improves the discrimination of technical term, and improves the quality of natural language processing.
Description
Technical field
This application involves natural language processing technique fields, specifically, this application involves a kind of identifications of technical term
Method, apparatus, electronic equipment and computer readable storage medium.
Background technique
Natural language processing is an important directions in computer science and artificial intelligence field.It is studied can be real
The various theory and methods of efficient communication are carried out between existing people and computer with natural language.Natural language processing is a Men Rongyu
Yan Xue, computer science, mathematics are in the science of one.Therefore, the research in this field will be related to natural language, i.e. people are daily
The language used, thus it have with philological research it is close contact, but have important difference.
In natural language processing field, most important link is to carry out keyword extraction, at existing natural language
In reason technology, TF-IDF (term frequency-inverse document frequency, the inverse text frequency of word frequency -) is
The most common keyword extracting method.But due to the limitation of professional domain, the technical term in certain professional domains is more multiple
It is miscellaneous, it is difficult to identify these technical terms from text using common participle technique, causes the discrimination of technical term lower,
And then lead to the second-rate of natural language processing.
Summary of the invention
This application provides a kind of method, apparatus of the identification of technical term, electronic equipment and computer-readable storage mediums
Matter can solve in natural language processing field, and the prior art is lower to the discrimination of technical term, natural language processing quality
Poor problem.The technical solution is as follows:
In a first aspect, a kind of knowledge method for distinguishing of technical term is provided, this method comprises:
The first corpus of the corresponding professional domain of technical term is obtained, and obtains second corpus in amateur field;
Based on first corpus and second corpus, reverse document-frequency value is obtained from first corpus and is greater than
It is determined as technical term equal to the vocabulary for presetting reverse document-frequency value, and by the vocabulary;
Based on the location information of the technical term, new technical term is determined from first corpus.
Preferably, first corpus for obtaining the corresponding professional domain of technical term, comprising:
The first text information is obtained from the associated webpage of the professional domain, using first text information as
One corpus.
Preferably, the step of second corpus for obtaining amateur field, comprising:
Obtain the second text information from the webpage of the professional domain dereferenced, using second text information as
Second corpus.
Preferably, described to be based on first corpus and second corpus, reverse text is obtained from first corpus
Part frequency values are more than or equal to the vocabulary for presetting reverse document-frequency value, and the step of vocabulary is determined as technical term, packet
It includes:
The reverse document-frequency for calculating each vocabulary in first corpus obtains the multiple first reverse document-frequency values;
Reverse document-frequency of each vocabulary in second corpus in first corpus is calculated, obtains multiple second
Reverse document-frequency value;
It is preset when the first reverse document-frequency value of any vocabulary and the difference of the second reverse document-frequency value are more than or equal to
When reverse document-frequency value, determine that the vocabulary is technical term.
Preferably, the location information based on the technical term determines new professional art from first corpus
The step of language, comprising:
Obtain location information of the technical term in first corpus;The location information includes mutual information and a left side
Right entropy information;
According to the mutual information and left and right entropy information, technical term new in first expectation is obtained.
Preferably, the location information based on the technical term determines new professional art from first corpus
The step of language, further includes:
The reverse document-frequency value of the technical term and the location information are inputted into preset conditional random field models,
Obtain technical term new in first corpus.
Second aspect provides a kind of device of the identification of technical term, which includes:
Module is obtained, for obtaining the first corpus of the corresponding professional domain of technical term, and obtains amateur field
Second corpus;
First determining module is obtained from first corpus for being based on first corpus and second corpus
Reverse document-frequency value is more than or equal to the vocabulary for presetting reverse document-frequency value, and the vocabulary is determined as technical term;
Second determining module determines newly from first corpus for the location information based on the technical term
Technical term.
Preferably, the acquisition module is specifically used for:
The first text information is obtained from the associated webpage of the professional domain, using first text information as
One corpus.
Preferably, the acquisition module is specifically used for:
Obtain the second text information from the webpage of the professional domain dereferenced, using second text information as
Second corpus.
Preferably, first determining module includes:
First computational submodule obtains multiple for calculating the reverse document-frequency of each vocabulary in first corpus
First reverse document-frequency value;
Second computational submodule, for calculating reverse text of each vocabulary in second corpus in first corpus
Part frequency obtains the multiple second reverse document-frequency values;
Comparative sub-module, for when the first reverse document-frequency value of any vocabulary and the difference of the second reverse document-frequency value
Value is more than or equal to when presetting reverse document-frequency value, determines that the vocabulary is technical term.
Preferably, second determining module includes:
Location information acquisition submodule, for obtaining location information of the technical term in first corpus;Institute
Stating location information includes mutual information and left and right entropy information;
First computational submodule, for according to the mutual information and left and right entropy information, obtain described first expect in it is new
Technical term.
Preferably, second determining module further include:
Second computational submodule, it is pre- for inputting the reverse document-frequency value of the technical term and the location information
The conditional random field models set obtain technical term new in first corpus.
The third aspect provides a kind of electronic equipment, which includes:
Processor, memory and bus;
The bus, for connecting the processor and the memory;
The memory, for storing operational order;
The processor, for by calling the operational order, executable instruction to execute processor such as the application
The corresponding operation of knowledge method for distinguishing of technical term shown in first aspect.
Fourth aspect provides a kind of computer readable storage medium, calculating is stored on computer readable storage medium
Machine program, the program realize the knowledge method for distinguishing of technical term shown in the application first aspect when being executed by processor.
Technical solution provided by the present application has the benefit that
The first corpus of the corresponding professional domain of technical term and second corpus in amateur field are obtained, is then based on
First corpus and the second corpus obtain reverse document-frequency value from the first corpus and are greater than the word for presetting reverse document-frequency value
It converges, and vocabulary is determined as technical term, in this way by comparing the corpus of professional domain and the corpus in amateur field
Mode, determine the vocabulary being frequent in the corpus of professional domain, compared with the existing technology in simply by common
Participle technique identifies the technical term in text, and the application can more completely identify professional art by way of contrast
Language to improve the discrimination of technical term, and then improves the quality of natural language processing.
Further, reverse document-frequency value and location information then based on technical term obtain newly from the first corpus
Technical term.It may further determine that out near the technical term the being frequent, technical term of appearance infrequently in this way,
The discrimination of technical term is further improved, and improves the quality of natural language processing.
Detailed description of the invention
In order to more clearly explain the technical solutions in the embodiments of the present application, institute in being described below to the embodiment of the present application
Attached drawing to be used is needed to be briefly described.
Fig. 1 is a kind of flow diagram of the recognition methods for technical term that the application one embodiment provides;
Fig. 2 is a kind of structural schematic diagram of the identification device for technical term that the another embodiment of the application provides;
Fig. 3 is a kind of structural schematic diagram of the electronic equipment of the identification for technical term that the another embodiment of the application provides.
Specific embodiment
Embodiments herein is described below in detail, examples of the embodiments are shown in the accompanying drawings, wherein from beginning to end
Same or similar label indicates same or similar element or element with the same or similar functions.Below with reference to attached
The embodiment of figure description is exemplary, and is only used for explaining the application, and is not construed as limiting the claims.
Those skilled in the art of the present technique are appreciated that unless expressly stated, singular " one " used herein, " one
It is a ", " described " and "the" may also comprise plural form.It is to be further understood that being arranged used in the description of the present application
Diction " comprising " refer to that there are the feature, integer, step, operation, element and/or component, but it is not excluded that in the presence of or addition
Other one or more features, integer, step, operation, element, component and/or their group.It should be understood that when we claim member
Part is " connected " or when " coupled " to another element, it can be directly connected or coupled to other elements, or there may also be
Intermediary element.In addition, " connection " used herein or " coupling " may include being wirelessly connected or wirelessly coupling.It is used herein to arrange
Diction "and/or" includes one or more associated wholes for listing item or any cell and all combinations.
To keep the purposes, technical schemes and advantages of the application clearer, below in conjunction with attached drawing to the application embodiment party
Formula is described in further detail.
Recognition methods, device, electronic equipment and the computer readable storage medium of technical term provided by the present application, it is intended to
Solve the technical problem as above of the prior art.
How the technical solution of the application and the technical solution of the application are solved with specifically embodiment below above-mentioned
Technical problem is described in detail.These specific embodiments can be combined with each other below, for the same or similar concept
Or process may repeat no more in certain embodiments.Below in conjunction with attached drawing, embodiments herein is described.
A kind of knowledge method for distinguishing of technical term is provided in one embodiment, as shown in Figure 1, this method comprises:
Step S101, obtains the first corpus of the corresponding professional domain of technical term, and obtains the second of amateur field
Corpus;
In practical applications, each professional domain has a corresponding technical term, the difference of professional domain, technical term
Complicated process is not also identical, in natural language processing field, it is simple any, such as " base station " of the communications field can pass through
Common participle technique identifies, but it is complicated any, can not may just identify, such as medical domain, chemistry
Field, certain technical terms of biological field are sufficiently complex, common participle technique to the discriminations of these technical terms very
It is low.
To solve the above problems, including the corpus of technical term in the available any professional domain of the embodiment of the present invention.
In a kind of preferred embodiment of the present invention, the first corpus of the corresponding professional domain of technical term is obtained, comprising:
The first text information is obtained from the associated webpage of professional domain, using the first text information as the first corpus.
Specifically, each professional domain has corresponding, authoritative BBS forum or other types of webpage, above
The contents such as note of seeking help for having a large amount of professional article, professional knowledge question and answer, in these professional articles, note of seeking help, it will usually
There are a large amount of technical terms, and the frequency occurred is also relatively high, so, the embodiment of the present invention can obtain these profession texts
The text information of chapter, note of seeking help, as the first corpus.
Further, webpage associated with professional domain can be set in advance by administrator, can be a webpage,
It can be multiple webpages, when obtaining the first corpus, directly obtain text envelope from associated one or more webpages
Breath is used as the first corpus.
The step of obtaining second corpus in amateur field, comprising:
The second text information is obtained from the webpage of professional domain dereferenced, using the second text information as the second language
Material.
Specifically, other webpages other than webpage associated with professional domain, all can serve as the second corpus
Acquisition source.For example, in the field of medicine, most authoritative some websites are " the identification medicine forum of technical term ", the forum
Including multiple webpages, then when obtaining the first corpus, so that it may be obtained from multiple webpages of the forum, and obtain the
When two corpus, other webpages other than multiple webpages in the forum all can serve as the acquisition source of the second corpus,
Such as multiple webpages in certain news website.In practical applications, it can also be set in advance with professional domain not by administrator
Associated other webpages are associated, and obtain text information from associated other webpages when obtaining corpus as second
Corpus.
It should be noted that it includes a large amount of text information that the first corpus and the second corpus, which are all, pass through increase in this way
The radix of vocabulary come improve vocabulary appearance frequency, and then promoted specialized vocabulary discrimination.
Step S102, be based on the first corpus and the second corpus, obtained from the first corpus reverse document-frequency value be greater than etc.
In presetting the vocabulary of reverse document-frequency value, and vocabulary is determined as technical term;
In general, the frequency that technical term occurs in the first corpus can be higher than the frequency occurred in the second corpus by one
A bit.For example, certain article is the introduction to " dimethyl dichlorosilane (DMCS) ", then first in the associated website of chemical field
In corpus, " dimethyl dichlorosilane (DMCS) " will repeatedly occur, and the second corpus assumes all to be daily daily news content, then very
Obviously, the frequency that " dimethyl dichlorosilane (DMCS) " occurs will be relatively low, even without.
In a kind of preferred embodiment of the present invention, it is based on the first corpus and the second corpus, is obtained from the first corpus reverse
Document-frequency value is more than or equal to the vocabulary for presetting reverse document-frequency value, and the step of vocabulary is determined as technical term, comprising:
The reverse document-frequency for calculating each vocabulary in the first corpus obtains the multiple first reverse document-frequency values;
Reverse document-frequency of each vocabulary in the second corpus in the first corpus is calculated, the multiple second reverse files are obtained
Frequency values;
It is preset when the first reverse document-frequency value of any vocabulary and the difference of the second reverse document-frequency value are more than or equal to
When reverse document-frequency value, determine that vocabulary is technical term.
Specifically, text information first can be carried out word segmentation processing after getting the first corpus, obtain multiple words
It converges, then calculates separately out the IDF (Inverse Document Frequency, inverse text frequency values) of each vocabulary, then divide
IDF of each vocabulary (each vocabulary obtained after the first corpus word segmentation processing) in the second corpus is not calculated, in this way, just obtaining
Each vocabulary IDF (the first IDF) in the first corpus and the IDF (the 2nd IDF) in the second corpus, when in each vocabulary
The difference of the first IDF and the 2nd IDF of any vocabulary when being more than or equal to default IDF, determine that the vocabulary is technical term, thus
Determine multiple technical terms in the first corpus.
Step S103, the location information based on technical term determine new technical term from the first corpus.
In step s 102, more relatively easy technical term can be identified, but a bit complicated professional art
Language can not may be identified just.For example, in step s 102, what is obtained after " dimethyl dichlorosilane (DMCS) " progress word segmentation processing is
" methyl ", " two ", " chlorine " and " silane ", wherein " methyl ", " chlorine ", " silane " they are also technical term, so, it finally determines
Come technical term be not " dimethyl dichlorosilane (DMCS) ", still, " methyl ", " two ", " chlorine ", " silane " four vocabulary IDF base
Originally it is the same, and is substantially the appearance that connects together, therefore, it is necessary to further identify technical term " methyl dichloro silicon
Alkane ".
In a kind of preferred embodiment of the present invention, the location information based on technical term determines newly from the first corpus
The step of technical term, comprising:
Obtain location information of the technical term in the first corpus;Location information includes mutual information and left and right entropy information;
According to mutual information and left and right entropy information, technical term new in the first expectation is obtained.
Wherein, mutual information embodies the degree of interdependence between two variables.Binary mutual information refers to two event phases
The amount of closing property, calculation formula are as follows:
Association relationship is higher, shows that X and Y correlation is higher, then a possibility that X and Y composition phrase is bigger;Conversely, mutual trust
A possibility that breath value is lower, and correlation is lower between X and Y, then there are phrasal boundaries between X and Y is bigger.X and Y in formula refer to
Be two adjacent words, P value is its probability of occurrence.
This term of entropy indicates the probabilistic measurement of stochastic variable.Specifically be expressed as follows: generally, if X be take it is limited
The stochastic variable (probability field that X is limited discrete event in other words) of a value, the probability of X value x are P (x), then the entropy of X is fixed
Justice are as follows:
H (X)=- ∑ (x ∈ X) P (x) log2P(x);
Left and right entropy refers to the entropy of the left margin of multi-character words expression and the entropy of right margin.The formula of left and right entropy is as follows:
Circular is, by taking left entropy as an example, to an all possible word in the string left side and word frequency, calculates information
Then entropy is summed, if entropy is 0, illustrate that it only has a kind of connecting.The two systems of the mutual information and entropy that the algorithm is mainly chosen
Metering measures extracting phrase of starting with from the boundary being tightly combined outside degree and word string inside word string respectively.
For example, using mutual information and left and right entropy algorithm can be true " methyl ", " two ", " chlorine ", " silane " four vocabulary
Make new technical term " dimethyl dichlorosilane (DMCS) ".
In a kind of preferred embodiment of the present invention, the location information based on technical term determines newly from the first corpus
The step of technical term, further includes:
The reverse document-frequency value and location information of technical term are inputted into preset conditional random field models, obtain first
New technical term in corpus.
Further, it is determined except new technical term except through the location information of technical term, it can also will be professional
The IDF and location information of term input preset CRF (conditional random field algorithm, condition random
), to obtain new technical term.
It should be noted that step S102~step S103 is in addition to that can determine professional art new shown in above-mentioned example
Language can also determine the new technical term on the technical term left side or the right, and i will not repeat them here.
In embodiments of the present invention, the first corpus and amateur neck of the corresponding professional domain of technical term are obtained first
Second corpus in domain, is then based on the first corpus and the second corpus, and reverse document-frequency value is obtained from the first corpus and is greater than in advance
If the vocabulary of reverse document-frequency value, and vocabulary is determined as technical term, in this way by by the corpus of professional domain and it is non-specially
The mode that the corpus in industry field compares determines the vocabulary being frequent in the corpus of professional domain, relative to existing
The technical term in text is identified in technology simply by common participle technique, the application by way of contrast can be more
It completely identifies technical term, to improve the discrimination of technical term, and then improves the quality of natural language processing.
Further, reverse document-frequency value and location information then based on technical term obtain newly from the first corpus
Technical term.It may further determine that out near the technical term the being frequent, technical term of appearance infrequently in this way,
The discrimination of technical term is further improved, and improves the quality of natural language processing.
Fig. 2 is a kind of structural schematic diagram of the identification device for technical term that the another embodiment of the application provides, such as Fig. 2 institute
Show, the device of the present embodiment may include:
Module 201 is obtained, for obtaining the first corpus of the corresponding professional domain of technical term, and obtains amateur field
The second corpus;
First determining module 202 is obtained from first corpus for being based on first corpus and second corpus
It takes reverse document-frequency value to be more than or equal to the vocabulary for presetting reverse document-frequency value, and the vocabulary is determined as technical term;
Second determining module 203 determines new for the location information based on the technical term from first corpus
Technical term.
In a kind of preferred embodiment of the present invention, the acquisition module is specifically used for:
The first text information is obtained from the associated webpage of the professional domain, using first text information as
One corpus.
In a kind of preferred embodiment of the present invention, the acquisition module is specifically used for:
Obtain the second text information from the webpage of the professional domain dereferenced, using second text information as
Second corpus.
In a kind of preferred embodiment of the present invention, first determining module includes:
First computational submodule obtains multiple for calculating the reverse document-frequency of each vocabulary in first corpus
First reverse document-frequency value;
Second computational submodule, for calculating reverse text of each vocabulary in second corpus in first corpus
Part frequency obtains the multiple second reverse document-frequency values;
Comparative sub-module, for when the first reverse document-frequency value of any vocabulary and the difference of the second reverse document-frequency value
Value is more than or equal to when presetting reverse document-frequency value, determines that the vocabulary is technical term.
In a kind of preferred embodiment of the present invention, second determining module includes:
Location information acquisition submodule, for obtaining location information of the technical term in first corpus;Institute
Stating location information includes mutual information and left and right entropy information;
First computational submodule, for according to the mutual information and left and right entropy information, obtain described first expect in it is new
Technical term.
In a kind of preferred embodiment of the present invention, second determining module further include:
Second computational submodule, it is pre- for inputting the reverse document-frequency value of the technical term and the location information
The conditional random field models set obtain technical term new in first corpus.
Technical term shown in the application one embodiment can be performed in the identification device of the technical term of the present embodiment
Recognition methods, realization principle is similar, and details are not described herein again.
A kind of electronic equipment is provided in the another embodiment of the application, which includes: memory and processor;
At least one program, is stored in memory, and when for being executed by processor, can realize compared with prior art: obtaining profession
First corpus of the corresponding professional domain of term and second corpus in amateur field, are then based on the first corpus and the second language
Material obtains reverse document-frequency value from the first corpus and is greater than the vocabulary for presetting reverse document-frequency value, and vocabulary is determined as
Technical term is determined in this way by way of comparing the corpus of the corpus of professional domain and amateur field special
In the corpus in industry field, the vocabulary being frequent, compared with the existing technology in simply by common participle technique identify text
Technical term in this, the application can more completely identify technical term by way of contrast, to improve professional art
The discrimination of language, and then improve the quality of natural language processing.
Further, reverse document-frequency value and location information then based on technical term obtain newly from the first corpus
Technical term.It may further determine that out near the technical term the being frequent, technical term of appearance infrequently in this way,
The discrimination of technical term is further improved, and improves the quality of natural language processing.
A kind of electronic equipment is provided in one alternate embodiment, as shown in figure 3, electronic equipment shown in Fig. 3 3000
It include: processor 3001 and memory 3003.Wherein, processor 3001 is connected with memory 3003, such as passes through 3002 phase of bus
Even.Optionally, electronic equipment 3000 can also include transceiver 3004.It should be noted that transceiver 3004 in practical application
It is not limited to one, the structure of the electronic equipment 3000 does not constitute the restriction to the embodiment of the present application.
Processor 3001 can be CPU, general processor, DSP, ASIC, FPGA or other programmable logic device, crystalline substance
Body pipe logical device, hardware component or any combination thereof.It, which may be implemented or executes, combines described by present disclosure
Various illustrative logic blocks, module and circuit.Processor 3001 is also possible to realize the combination of computing function, such as wraps
It is combined containing one or more microprocessors, DSP and the combination of microprocessor etc..
Bus 3002 may include an access, and information is transmitted between said modules.Bus 3002 can be pci bus or
Eisa bus etc..Bus 3002 can be divided into address bus, data/address bus, control bus etc..Only to be used in Fig. 3 convenient for indicating
One thick line indicates, it is not intended that an only bus or a type of bus.
Memory 3003 can be ROM or can store the other kinds of static storage device of static information and instruction, RAM
Or the other kinds of dynamic memory of information and instruction can be stored, it is also possible to EEPROM, CD-ROM or other CDs
Storage, optical disc storage (including compression optical disc, laser disc, optical disc, Digital Versatile Disc, Blu-ray Disc etc.), magnetic disk storage medium
Or other magnetic storage apparatus or can be used in carry or store have instruction or data structure form desired program generation
Code and can by any other medium of computer access, but not limited to this.
Memory 3003 is used to store the application code for executing application scheme, and is held by processor 3001 to control
Row.Processor 3001 is for executing the application code stored in memory 3003, to realize aforementioned either method embodiment
Shown in content.
Wherein, electronic equipment includes but is not limited to: mobile phone, laptop, digit broadcasting receiver, PDA are (personal
Digital assistants), PAD (tablet computer), PMP (portable media player), car-mounted terminal (such as vehicle mounted guidance terminal) etc.
Deng mobile terminal and such as number TV, desktop computer etc. fixed terminal.
The another embodiment of the application provides a kind of computer readable storage medium, on the computer readable storage medium
It is stored with computer program, when run on a computer, computer is executed corresponding in preceding method embodiment
Content.Compared with prior art, the first corpus of the corresponding professional domain of technical term and second language in amateur field are obtained
Material, be then based on the first corpus and the second corpus, obtained from the first corpus reverse document-frequency value be greater than preset reverse file
The vocabulary of frequency values, and vocabulary is determined as technical term, in this way by by the language of the corpus of professional domain and amateur field
Expect the mode that compares, determine the vocabulary being frequent in the corpus of professional domain, compared with the existing technology in only
The technical term in text is identified by common participle technique, the application can be identified more completely by way of contrast
Technical term out to improve the discrimination of technical term, and then improves the quality of natural language processing.
Further, reverse document-frequency value and location information then based on technical term obtain newly from the first corpus
Technical term.It may further determine that out near the technical term the being frequent, technical term of appearance infrequently in this way,
The discrimination of technical term is further improved, and improves the quality of natural language processing.
The method of the embodiment of the present invention further include:
A1, a kind of recognition methods of technical term, comprising:
The first corpus of the corresponding professional domain of technical term is obtained, and obtains second corpus in amateur field;
Based on first corpus and second corpus, reverse document-frequency value is obtained from first corpus and is greater than
It is determined as technical term equal to the vocabulary for presetting reverse document-frequency value, and by the vocabulary;
Based on the location information of the technical term, new technical term is determined from first corpus.
The recognition methods of A2, technical term according to a1, it is described to obtain the of the corresponding professional domain of technical term
One corpus, comprising:
The first text information is obtained from the associated webpage of the professional domain, using first text information as
One corpus.
The step of recognition methods of A3, technical term according to a1, second corpus for obtaining amateur field,
Include:
Obtain the second text information from the webpage of the professional domain dereferenced, using second text information as
Second corpus.
The recognition methods of A4, technical term according to a1, it is described to be based on first corpus and second corpus,
Reverse document-frequency value is obtained from first corpus be more than or equal to and preset the vocabulary of reverse document-frequency value, and by institute's predicate
Remittance is determined as the step of technical term, comprising:
The reverse document-frequency for calculating each vocabulary in first corpus obtains the multiple first reverse document-frequency values;
Reverse document-frequency of each vocabulary in second corpus in first corpus is calculated, obtains multiple second
Reverse document-frequency value;
It is preset when the first reverse document-frequency value of any vocabulary and the difference of the second reverse document-frequency value are more than or equal to
When reverse document-frequency value, determine that the vocabulary is technical term.
The recognition methods of A5, technical term according to a1, the location information based on the technical term, from institute
State the step of new technical term is determined in the first corpus, comprising:
Obtain location information of the technical term in first corpus;The location information includes mutual information and a left side
Right entropy information;
According to the mutual information and left and right entropy information, technical term new in first expectation is obtained.
The recognition methods of A6, the technical term according to A1 or A5, the location information based on the technical term,
The step of new technical term is determined from first corpus, further includes:
The reverse document-frequency value of the technical term and the location information are inputted into preset conditional random field models,
Obtain technical term new in first corpus.
B7, a kind of identification device of technical term, comprising:
Module is obtained, for obtaining the first corpus of the corresponding professional domain of technical term, and obtains amateur field
Second corpus;
First determining module is obtained from first corpus for being based on first corpus and second corpus
Reverse document-frequency value is more than or equal to the vocabulary for presetting reverse document-frequency value, and the vocabulary is determined as technical term;
Second determining module determines newly from first corpus for the location information based on the technical term
Technical term.
The identification device of B8, the technical term according to B7, the acquisition module are specifically used for:
The first text information is obtained from the associated webpage of the professional domain, using first text information as
One corpus.
The identification device of B9, the technical term according to B7, the acquisition module are specifically used for:
Obtain the second text information from the webpage of the professional domain dereferenced, using second text information as
Second corpus.
The identification device of B10, the technical term according to B7, first determining module include:
First computational submodule obtains multiple for calculating the reverse document-frequency of each vocabulary in first corpus
First reverse document-frequency value;
Second computational submodule, for calculating reverse text of each vocabulary in second corpus in first corpus
Part frequency obtains the multiple second reverse document-frequency values;
Comparative sub-module, for when the first reverse document-frequency value of any vocabulary and the difference of the second reverse document-frequency value
Value is more than or equal to when presetting reverse document-frequency value, determines that the vocabulary is technical term.
The identification device of B11, the technical term according to B7, second determining module include:
Location information acquisition submodule, for obtaining location information of the technical term in first corpus;Institute
Stating location information includes mutual information and left and right entropy information;
First computational submodule, for according to the mutual information and left and right entropy information, obtain described first expect in it is new
Technical term.
The identification device of B12, the technical term according to B7 or B11, second determining module further include:
Second computational submodule, it is pre- for inputting the reverse document-frequency value of the technical term and the location information
The conditional random field models set obtain technical term new in first corpus.
C13, a kind of electronic equipment comprising:
Processor, memory and bus;
The bus, for connecting the processor and the memory;
The memory, for storing operational order;
The processor, for executing professional art described in any one of above-mentioned A1-A6 by calling the operational order
The recognition methods of language.
D14, a kind of computer readable storage medium, the computer storage medium is for storing computer instruction, when it
When running on computers, computer is allowed to execute the recognition methods of technical term described in any one of above-mentioned A1-A6.
It should be understood that although each step in the flow chart of attached drawing is successively shown according to the instruction of arrow,
These steps are not that the inevitable sequence according to arrow instruction successively executes.Unless expressly stating otherwise herein, these steps
Execution there is no stringent sequences to limit, can execute in the other order.Moreover, at least one in the flow chart of attached drawing
Part steps may include that perhaps these sub-steps of multiple stages or stage are not necessarily in synchronization to multiple sub-steps
Completion is executed, but can be executed at different times, execution sequence, which is also not necessarily, successively to be carried out, but can be with other
At least part of the sub-step or stage of step or other steps executes in turn or alternately.
The above is only some embodiments of the invention, it is noted that for the ordinary skill people of the art
For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered
It is considered as protection scope of the present invention.
Claims (10)
1. a kind of recognition methods of technical term characterized by comprising
The first corpus of the corresponding professional domain of technical term is obtained, and obtains second corpus in amateur field;
Based on first corpus and second corpus, reverse document-frequency value is obtained from first corpus and is more than or equal to
The vocabulary of reverse document-frequency value is preset, and the vocabulary is determined as technical term;
Based on the location information of the technical term, new technical term is determined from first corpus.
2. the recognition methods of technical term according to claim 1, which is characterized in that the acquisition technical term is corresponding
First corpus of professional domain, comprising:
The first text information is obtained from the associated webpage of the professional domain, using first text information as the first language
Material.
3. the recognition methods of technical term according to claim 1, which is characterized in that described to obtain the of amateur field
The step of two corpus, comprising:
The second text information is obtained from the webpage of the professional domain dereferenced, using second text information as second
Corpus.
4. the recognition methods of technical term according to claim 1, which is characterized in that it is described based on first corpus and
Second corpus obtains reverse document-frequency value from first corpus and is more than or equal to the word for presetting reverse document-frequency value
It converges, and the step of vocabulary is determined as technical term, comprising:
The reverse document-frequency for calculating each vocabulary in first corpus obtains the multiple first reverse document-frequency values;
Reverse document-frequency of each vocabulary in second corpus in first corpus is calculated, it is reverse to obtain multiple second
Document-frequency value;
It is preset inversely when the first reverse document-frequency value of any vocabulary and the difference of the second reverse document-frequency value are more than or equal to
When document-frequency value, determine that the vocabulary is technical term.
5. the recognition methods of technical term according to claim 1, which is characterized in that described based on the technical term
Location information, the step of new technical term is determined from first corpus, comprising:
Obtain location information of the technical term in first corpus;The location information includes mutual information and left and right entropy
Information;
According to the mutual information and left and right entropy information, technical term new in first expectation is obtained.
6. the recognition methods of technical term according to claim 1 or 5, which is characterized in that described based on the professional art
The location information of language, the step of new technical term is determined from first corpus, further includes:
The reverse document-frequency value of the technical term and the location information are inputted into preset conditional random field models, obtained
New technical term in first corpus.
7. a kind of identification device of technical term characterized by comprising
Module is obtained, for obtaining the first corpus of the corresponding professional domain of technical term, and obtains the second of amateur field
Corpus;
First determining module obtains reverse for being based on first corpus and second corpus from first corpus
Document-frequency value is more than or equal to the vocabulary for presetting reverse document-frequency value, and the vocabulary is determined as technical term;
Second determining module determines new profession for the location information based on the technical term from first corpus
Term.
8. the identification device of technical term according to claim 7, which is characterized in that the acquisition module is specifically used for:
The first text information is obtained from the associated webpage of the professional domain, using first text information as the first language
Material.
9. a kind of electronic equipment, characterized in that it comprises:
Processor, memory and bus;
The bus, for connecting the processor and the memory;
The memory, for storing operational order;
The processor, for executing profession described in any one of the claims 1-6 by calling the operational order
The recognition methods of term.
10. a kind of computer readable storage medium, which is characterized in that the computer storage medium refers to for storing computer
It enables, when run on a computer, computer is allowed to execute professional art described in any one of the claims 1-6
The recognition methods of language.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910457246.4A CN110175331B (en) | 2019-05-29 | 2019-05-29 | Method and device for identifying professional terms, electronic equipment and readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910457246.4A CN110175331B (en) | 2019-05-29 | 2019-05-29 | Method and device for identifying professional terms, electronic equipment and readable storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110175331A true CN110175331A (en) | 2019-08-27 |
CN110175331B CN110175331B (en) | 2021-05-11 |
Family
ID=67695908
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910457246.4A Active CN110175331B (en) | 2019-05-29 | 2019-05-29 | Method and device for identifying professional terms, electronic equipment and readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110175331B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111950274A (en) * | 2020-07-31 | 2020-11-17 | 中国工商银行股份有限公司 | Chinese word segmentation method and device for linguistic data in professional field |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103049501A (en) * | 2012-12-11 | 2013-04-17 | 上海大学 | Chinese domain term recognition method based on mutual information and conditional random field model |
CN103995876A (en) * | 2014-05-26 | 2014-08-20 | 上海大学 | Text classification method based on chi square statistics and SMO algorithm |
CN104572622A (en) * | 2015-01-05 | 2015-04-29 | 语联网(武汉)信息技术有限公司 | Term filtering method |
US20160239482A1 (en) * | 2015-02-13 | 2016-08-18 | International Business Machines Corporation | Identifying word-senses based on linguistic variations |
CN106445907A (en) * | 2015-08-06 | 2017-02-22 | 北京国双科技有限公司 | Domain lexicon generation method and apparatus |
CN106776558A (en) * | 2016-12-14 | 2017-05-31 | 北京工业大学 | Merge the domain term recognition method of language ambience information |
-
2019
- 2019-05-29 CN CN201910457246.4A patent/CN110175331B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103049501A (en) * | 2012-12-11 | 2013-04-17 | 上海大学 | Chinese domain term recognition method based on mutual information and conditional random field model |
CN103995876A (en) * | 2014-05-26 | 2014-08-20 | 上海大学 | Text classification method based on chi square statistics and SMO algorithm |
CN104572622A (en) * | 2015-01-05 | 2015-04-29 | 语联网(武汉)信息技术有限公司 | Term filtering method |
US20160239482A1 (en) * | 2015-02-13 | 2016-08-18 | International Business Machines Corporation | Identifying word-senses based on linguistic variations |
CN106445907A (en) * | 2015-08-06 | 2017-02-22 | 北京国双科技有限公司 | Domain lexicon generation method and apparatus |
CN106776558A (en) * | 2016-12-14 | 2017-05-31 | 北京工业大学 | Merge the domain term recognition method of language ambience information |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111950274A (en) * | 2020-07-31 | 2020-11-17 | 中国工商银行股份有限公司 | Chinese word segmentation method and device for linguistic data in professional field |
Also Published As
Publication number | Publication date |
---|---|
CN110175331B (en) | 2021-05-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11281860B2 (en) | Method, apparatus and device for recognizing text type | |
US20200250379A1 (en) | Method and apparatus for textual semantic encoding | |
US11042542B2 (en) | Method and apparatus for providing aggregate result of question-and-answer information | |
US7725466B2 (en) | High accuracy document information-element vector encoding server | |
JP2017010514A (en) | Search engine and method for implementing the same | |
Wan et al. | Composite feature extraction and selection for text classification | |
US9514113B1 (en) | Methods for automatic footnote generation | |
Han et al. | State complexity of basic operations on suffix-free regular languages | |
CN108171576B (en) | Order processing method and device, electronic equipment and computer readable storage medium | |
CN104866478A (en) | Detection recognition method and device of malicious text | |
CN110674635B (en) | Method and device for dividing text paragraphs | |
CN110175331A (en) | Recognition methods, device, electronic equipment and the readable storage medium storing program for executing of technical term | |
CN110390011B (en) | Data classification method and device | |
CN113505595A (en) | Text phrase extraction method and device, computer equipment and storage medium | |
Kim et al. | Toward privacy-preserving text embedding similarity with homomorphic encryption | |
US20090164430A1 (en) | System and method for acquiring contact information | |
Lee et al. | GLEN: Generative retrieval via lexical index learning | |
CN113761565B (en) | Data desensitization method and device | |
US20160335053A1 (en) | Generating compact representations of high-dimensional data | |
CN110738042B (en) | Error correction dictionary creation method, device, terminal and computer storage medium | |
CN112417874A (en) | Named entity recognition method and device, storage medium and electronic device | |
CN109740130B (en) | Method and device for generating file | |
EP3195146A1 (en) | Three-dimensional latent semantic analysis | |
CN107832341B (en) | AGNSS user duplicate removal statistical method | |
CN114742058A (en) | Named entity extraction method and device, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20200806 Address after: 518057 Nanshan District science and technology zone, Guangdong, Zhejiang Province, science and technology in the Tencent Building on the 1st floor of the 35 layer Applicant after: TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd. Address before: 100029, Beijing, Chaoyang District new East Street, building No. 2, -3 to 25, 101, 8, 804 rooms Applicant before: Tricorn (Beijing) Technology Co.,Ltd. |
|
GR01 | Patent grant | ||
GR01 | Patent grant |