CN110175331A - Recognition methods, device, electronic equipment and the readable storage medium storing program for executing of technical term - Google Patents

Recognition methods, device, electronic equipment and the readable storage medium storing program for executing of technical term Download PDF

Info

Publication number
CN110175331A
CN110175331A CN201910457246.4A CN201910457246A CN110175331A CN 110175331 A CN110175331 A CN 110175331A CN 201910457246 A CN201910457246 A CN 201910457246A CN 110175331 A CN110175331 A CN 110175331A
Authority
CN
China
Prior art keywords
corpus
technical term
vocabulary
frequency value
reverse document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910457246.4A
Other languages
Chinese (zh)
Other versions
CN110175331B (en
Inventor
王卓然
亓超
马宇驰
陈华荣
秦海龙
郭伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Triangle Animal (beijing) Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Triangle Animal (beijing) Technology Co Ltd filed Critical Triangle Animal (beijing) Technology Co Ltd
Priority to CN201910457246.4A priority Critical patent/CN110175331B/en
Publication of CN110175331A publication Critical patent/CN110175331A/en
Application granted granted Critical
Publication of CN110175331B publication Critical patent/CN110175331B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

This application provides a kind of recognition methods of technical term, device, electronic equipment and computer readable storage mediums, are related to natural language processing field.This method comprises: obtaining the first corpus of the corresponding professional domain of technical term, with second corpus in amateur field, it is then based on the first corpus and the second corpus, reverse document-frequency value is obtained from the first corpus and is greater than the vocabulary for presetting reverse document-frequency value, and vocabulary is determined as technical term.The application can more completely identify technical term by way of contrast, to improve the discrimination of technical term, and then improve the quality of natural language processing.Further, based on the location information of technical term, new technical term is obtained from the first corpus, further improves the discrimination of technical term, and improves the quality of natural language processing.

Description

Recognition methods, device, electronic equipment and the readable storage medium storing program for executing of technical term
Technical field
This application involves natural language processing technique fields, specifically, this application involves a kind of identifications of technical term Method, apparatus, electronic equipment and computer readable storage medium.
Background technique
Natural language processing is an important directions in computer science and artificial intelligence field.It is studied can be real The various theory and methods of efficient communication are carried out between existing people and computer with natural language.Natural language processing is a Men Rongyu Yan Xue, computer science, mathematics are in the science of one.Therefore, the research in this field will be related to natural language, i.e. people are daily The language used, thus it have with philological research it is close contact, but have important difference.
In natural language processing field, most important link is to carry out keyword extraction, at existing natural language In reason technology, TF-IDF (term frequency-inverse document frequency, the inverse text frequency of word frequency -) is The most common keyword extracting method.But due to the limitation of professional domain, the technical term in certain professional domains is more multiple It is miscellaneous, it is difficult to identify these technical terms from text using common participle technique, causes the discrimination of technical term lower, And then lead to the second-rate of natural language processing.
Summary of the invention
This application provides a kind of method, apparatus of the identification of technical term, electronic equipment and computer-readable storage mediums Matter can solve in natural language processing field, and the prior art is lower to the discrimination of technical term, natural language processing quality Poor problem.The technical solution is as follows:
In a first aspect, a kind of knowledge method for distinguishing of technical term is provided, this method comprises:
The first corpus of the corresponding professional domain of technical term is obtained, and obtains second corpus in amateur field;
Based on first corpus and second corpus, reverse document-frequency value is obtained from first corpus and is greater than It is determined as technical term equal to the vocabulary for presetting reverse document-frequency value, and by the vocabulary;
Based on the location information of the technical term, new technical term is determined from first corpus.
Preferably, first corpus for obtaining the corresponding professional domain of technical term, comprising:
The first text information is obtained from the associated webpage of the professional domain, using first text information as One corpus.
Preferably, the step of second corpus for obtaining amateur field, comprising:
Obtain the second text information from the webpage of the professional domain dereferenced, using second text information as Second corpus.
Preferably, described to be based on first corpus and second corpus, reverse text is obtained from first corpus Part frequency values are more than or equal to the vocabulary for presetting reverse document-frequency value, and the step of vocabulary is determined as technical term, packet It includes:
The reverse document-frequency for calculating each vocabulary in first corpus obtains the multiple first reverse document-frequency values;
Reverse document-frequency of each vocabulary in second corpus in first corpus is calculated, obtains multiple second Reverse document-frequency value;
It is preset when the first reverse document-frequency value of any vocabulary and the difference of the second reverse document-frequency value are more than or equal to When reverse document-frequency value, determine that the vocabulary is technical term.
Preferably, the location information based on the technical term determines new professional art from first corpus The step of language, comprising:
Obtain location information of the technical term in first corpus;The location information includes mutual information and a left side Right entropy information;
According to the mutual information and left and right entropy information, technical term new in first expectation is obtained.
Preferably, the location information based on the technical term determines new professional art from first corpus The step of language, further includes:
The reverse document-frequency value of the technical term and the location information are inputted into preset conditional random field models, Obtain technical term new in first corpus.
Second aspect provides a kind of device of the identification of technical term, which includes:
Module is obtained, for obtaining the first corpus of the corresponding professional domain of technical term, and obtains amateur field Second corpus;
First determining module is obtained from first corpus for being based on first corpus and second corpus Reverse document-frequency value is more than or equal to the vocabulary for presetting reverse document-frequency value, and the vocabulary is determined as technical term;
Second determining module determines newly from first corpus for the location information based on the technical term Technical term.
Preferably, the acquisition module is specifically used for:
The first text information is obtained from the associated webpage of the professional domain, using first text information as One corpus.
Preferably, the acquisition module is specifically used for:
Obtain the second text information from the webpage of the professional domain dereferenced, using second text information as Second corpus.
Preferably, first determining module includes:
First computational submodule obtains multiple for calculating the reverse document-frequency of each vocabulary in first corpus First reverse document-frequency value;
Second computational submodule, for calculating reverse text of each vocabulary in second corpus in first corpus Part frequency obtains the multiple second reverse document-frequency values;
Comparative sub-module, for when the first reverse document-frequency value of any vocabulary and the difference of the second reverse document-frequency value Value is more than or equal to when presetting reverse document-frequency value, determines that the vocabulary is technical term.
Preferably, second determining module includes:
Location information acquisition submodule, for obtaining location information of the technical term in first corpus;Institute Stating location information includes mutual information and left and right entropy information;
First computational submodule, for according to the mutual information and left and right entropy information, obtain described first expect in it is new Technical term.
Preferably, second determining module further include:
Second computational submodule, it is pre- for inputting the reverse document-frequency value of the technical term and the location information The conditional random field models set obtain technical term new in first corpus.
The third aspect provides a kind of electronic equipment, which includes:
Processor, memory and bus;
The bus, for connecting the processor and the memory;
The memory, for storing operational order;
The processor, for by calling the operational order, executable instruction to execute processor such as the application The corresponding operation of knowledge method for distinguishing of technical term shown in first aspect.
Fourth aspect provides a kind of computer readable storage medium, calculating is stored on computer readable storage medium Machine program, the program realize the knowledge method for distinguishing of technical term shown in the application first aspect when being executed by processor.
Technical solution provided by the present application has the benefit that
The first corpus of the corresponding professional domain of technical term and second corpus in amateur field are obtained, is then based on First corpus and the second corpus obtain reverse document-frequency value from the first corpus and are greater than the word for presetting reverse document-frequency value It converges, and vocabulary is determined as technical term, in this way by comparing the corpus of professional domain and the corpus in amateur field Mode, determine the vocabulary being frequent in the corpus of professional domain, compared with the existing technology in simply by common Participle technique identifies the technical term in text, and the application can more completely identify professional art by way of contrast Language to improve the discrimination of technical term, and then improves the quality of natural language processing.
Further, reverse document-frequency value and location information then based on technical term obtain newly from the first corpus Technical term.It may further determine that out near the technical term the being frequent, technical term of appearance infrequently in this way, The discrimination of technical term is further improved, and improves the quality of natural language processing.
Detailed description of the invention
In order to more clearly explain the technical solutions in the embodiments of the present application, institute in being described below to the embodiment of the present application Attached drawing to be used is needed to be briefly described.
Fig. 1 is a kind of flow diagram of the recognition methods for technical term that the application one embodiment provides;
Fig. 2 is a kind of structural schematic diagram of the identification device for technical term that the another embodiment of the application provides;
Fig. 3 is a kind of structural schematic diagram of the electronic equipment of the identification for technical term that the another embodiment of the application provides.
Specific embodiment
Embodiments herein is described below in detail, examples of the embodiments are shown in the accompanying drawings, wherein from beginning to end Same or similar label indicates same or similar element or element with the same or similar functions.Below with reference to attached The embodiment of figure description is exemplary, and is only used for explaining the application, and is not construed as limiting the claims.
Those skilled in the art of the present technique are appreciated that unless expressly stated, singular " one " used herein, " one It is a ", " described " and "the" may also comprise plural form.It is to be further understood that being arranged used in the description of the present application Diction " comprising " refer to that there are the feature, integer, step, operation, element and/or component, but it is not excluded that in the presence of or addition Other one or more features, integer, step, operation, element, component and/or their group.It should be understood that when we claim member Part is " connected " or when " coupled " to another element, it can be directly connected or coupled to other elements, or there may also be Intermediary element.In addition, " connection " used herein or " coupling " may include being wirelessly connected or wirelessly coupling.It is used herein to arrange Diction "and/or" includes one or more associated wholes for listing item or any cell and all combinations.
To keep the purposes, technical schemes and advantages of the application clearer, below in conjunction with attached drawing to the application embodiment party Formula is described in further detail.
Recognition methods, device, electronic equipment and the computer readable storage medium of technical term provided by the present application, it is intended to Solve the technical problem as above of the prior art.
How the technical solution of the application and the technical solution of the application are solved with specifically embodiment below above-mentioned Technical problem is described in detail.These specific embodiments can be combined with each other below, for the same or similar concept Or process may repeat no more in certain embodiments.Below in conjunction with attached drawing, embodiments herein is described.
A kind of knowledge method for distinguishing of technical term is provided in one embodiment, as shown in Figure 1, this method comprises:
Step S101, obtains the first corpus of the corresponding professional domain of technical term, and obtains the second of amateur field Corpus;
In practical applications, each professional domain has a corresponding technical term, the difference of professional domain, technical term Complicated process is not also identical, in natural language processing field, it is simple any, such as " base station " of the communications field can pass through Common participle technique identifies, but it is complicated any, can not may just identify, such as medical domain, chemistry Field, certain technical terms of biological field are sufficiently complex, common participle technique to the discriminations of these technical terms very It is low.
To solve the above problems, including the corpus of technical term in the available any professional domain of the embodiment of the present invention.
In a kind of preferred embodiment of the present invention, the first corpus of the corresponding professional domain of technical term is obtained, comprising:
The first text information is obtained from the associated webpage of professional domain, using the first text information as the first corpus.
Specifically, each professional domain has corresponding, authoritative BBS forum or other types of webpage, above The contents such as note of seeking help for having a large amount of professional article, professional knowledge question and answer, in these professional articles, note of seeking help, it will usually There are a large amount of technical terms, and the frequency occurred is also relatively high, so, the embodiment of the present invention can obtain these profession texts The text information of chapter, note of seeking help, as the first corpus.
Further, webpage associated with professional domain can be set in advance by administrator, can be a webpage, It can be multiple webpages, when obtaining the first corpus, directly obtain text envelope from associated one or more webpages Breath is used as the first corpus.
The step of obtaining second corpus in amateur field, comprising:
The second text information is obtained from the webpage of professional domain dereferenced, using the second text information as the second language Material.
Specifically, other webpages other than webpage associated with professional domain, all can serve as the second corpus Acquisition source.For example, in the field of medicine, most authoritative some websites are " the identification medicine forum of technical term ", the forum Including multiple webpages, then when obtaining the first corpus, so that it may be obtained from multiple webpages of the forum, and obtain the When two corpus, other webpages other than multiple webpages in the forum all can serve as the acquisition source of the second corpus, Such as multiple webpages in certain news website.In practical applications, it can also be set in advance with professional domain not by administrator Associated other webpages are associated, and obtain text information from associated other webpages when obtaining corpus as second Corpus.
It should be noted that it includes a large amount of text information that the first corpus and the second corpus, which are all, pass through increase in this way The radix of vocabulary come improve vocabulary appearance frequency, and then promoted specialized vocabulary discrimination.
Step S102, be based on the first corpus and the second corpus, obtained from the first corpus reverse document-frequency value be greater than etc. In presetting the vocabulary of reverse document-frequency value, and vocabulary is determined as technical term;
In general, the frequency that technical term occurs in the first corpus can be higher than the frequency occurred in the second corpus by one A bit.For example, certain article is the introduction to " dimethyl dichlorosilane (DMCS) ", then first in the associated website of chemical field In corpus, " dimethyl dichlorosilane (DMCS) " will repeatedly occur, and the second corpus assumes all to be daily daily news content, then very Obviously, the frequency that " dimethyl dichlorosilane (DMCS) " occurs will be relatively low, even without.
In a kind of preferred embodiment of the present invention, it is based on the first corpus and the second corpus, is obtained from the first corpus reverse Document-frequency value is more than or equal to the vocabulary for presetting reverse document-frequency value, and the step of vocabulary is determined as technical term, comprising:
The reverse document-frequency for calculating each vocabulary in the first corpus obtains the multiple first reverse document-frequency values;
Reverse document-frequency of each vocabulary in the second corpus in the first corpus is calculated, the multiple second reverse files are obtained Frequency values;
It is preset when the first reverse document-frequency value of any vocabulary and the difference of the second reverse document-frequency value are more than or equal to When reverse document-frequency value, determine that vocabulary is technical term.
Specifically, text information first can be carried out word segmentation processing after getting the first corpus, obtain multiple words It converges, then calculates separately out the IDF (Inverse Document Frequency, inverse text frequency values) of each vocabulary, then divide IDF of each vocabulary (each vocabulary obtained after the first corpus word segmentation processing) in the second corpus is not calculated, in this way, just obtaining Each vocabulary IDF (the first IDF) in the first corpus and the IDF (the 2nd IDF) in the second corpus, when in each vocabulary The difference of the first IDF and the 2nd IDF of any vocabulary when being more than or equal to default IDF, determine that the vocabulary is technical term, thus Determine multiple technical terms in the first corpus.
Step S103, the location information based on technical term determine new technical term from the first corpus.
In step s 102, more relatively easy technical term can be identified, but a bit complicated professional art Language can not may be identified just.For example, in step s 102, what is obtained after " dimethyl dichlorosilane (DMCS) " progress word segmentation processing is " methyl ", " two ", " chlorine " and " silane ", wherein " methyl ", " chlorine ", " silane " they are also technical term, so, it finally determines Come technical term be not " dimethyl dichlorosilane (DMCS) ", still, " methyl ", " two ", " chlorine ", " silane " four vocabulary IDF base Originally it is the same, and is substantially the appearance that connects together, therefore, it is necessary to further identify technical term " methyl dichloro silicon Alkane ".
In a kind of preferred embodiment of the present invention, the location information based on technical term determines newly from the first corpus The step of technical term, comprising:
Obtain location information of the technical term in the first corpus;Location information includes mutual information and left and right entropy information;
According to mutual information and left and right entropy information, technical term new in the first expectation is obtained.
Wherein, mutual information embodies the degree of interdependence between two variables.Binary mutual information refers to two event phases The amount of closing property, calculation formula are as follows:
Association relationship is higher, shows that X and Y correlation is higher, then a possibility that X and Y composition phrase is bigger;Conversely, mutual trust A possibility that breath value is lower, and correlation is lower between X and Y, then there are phrasal boundaries between X and Y is bigger.X and Y in formula refer to Be two adjacent words, P value is its probability of occurrence.
This term of entropy indicates the probabilistic measurement of stochastic variable.Specifically be expressed as follows: generally, if X be take it is limited The stochastic variable (probability field that X is limited discrete event in other words) of a value, the probability of X value x are P (x), then the entropy of X is fixed Justice are as follows:
H (X)=- ∑ (x ∈ X) P (x) log2P(x);
Left and right entropy refers to the entropy of the left margin of multi-character words expression and the entropy of right margin.The formula of left and right entropy is as follows:
Circular is, by taking left entropy as an example, to an all possible word in the string left side and word frequency, calculates information Then entropy is summed, if entropy is 0, illustrate that it only has a kind of connecting.The two systems of the mutual information and entropy that the algorithm is mainly chosen Metering measures extracting phrase of starting with from the boundary being tightly combined outside degree and word string inside word string respectively.
For example, using mutual information and left and right entropy algorithm can be true " methyl ", " two ", " chlorine ", " silane " four vocabulary Make new technical term " dimethyl dichlorosilane (DMCS) ".
In a kind of preferred embodiment of the present invention, the location information based on technical term determines newly from the first corpus The step of technical term, further includes:
The reverse document-frequency value and location information of technical term are inputted into preset conditional random field models, obtain first New technical term in corpus.
Further, it is determined except new technical term except through the location information of technical term, it can also will be professional The IDF and location information of term input preset CRF (conditional random field algorithm, condition random ), to obtain new technical term.
It should be noted that step S102~step S103 is in addition to that can determine professional art new shown in above-mentioned example Language can also determine the new technical term on the technical term left side or the right, and i will not repeat them here.
In embodiments of the present invention, the first corpus and amateur neck of the corresponding professional domain of technical term are obtained first Second corpus in domain, is then based on the first corpus and the second corpus, and reverse document-frequency value is obtained from the first corpus and is greater than in advance If the vocabulary of reverse document-frequency value, and vocabulary is determined as technical term, in this way by by the corpus of professional domain and it is non-specially The mode that the corpus in industry field compares determines the vocabulary being frequent in the corpus of professional domain, relative to existing The technical term in text is identified in technology simply by common participle technique, the application by way of contrast can be more It completely identifies technical term, to improve the discrimination of technical term, and then improves the quality of natural language processing.
Further, reverse document-frequency value and location information then based on technical term obtain newly from the first corpus Technical term.It may further determine that out near the technical term the being frequent, technical term of appearance infrequently in this way, The discrimination of technical term is further improved, and improves the quality of natural language processing.
Fig. 2 is a kind of structural schematic diagram of the identification device for technical term that the another embodiment of the application provides, such as Fig. 2 institute Show, the device of the present embodiment may include:
Module 201 is obtained, for obtaining the first corpus of the corresponding professional domain of technical term, and obtains amateur field The second corpus;
First determining module 202 is obtained from first corpus for being based on first corpus and second corpus It takes reverse document-frequency value to be more than or equal to the vocabulary for presetting reverse document-frequency value, and the vocabulary is determined as technical term;
Second determining module 203 determines new for the location information based on the technical term from first corpus Technical term.
In a kind of preferred embodiment of the present invention, the acquisition module is specifically used for:
The first text information is obtained from the associated webpage of the professional domain, using first text information as One corpus.
In a kind of preferred embodiment of the present invention, the acquisition module is specifically used for:
Obtain the second text information from the webpage of the professional domain dereferenced, using second text information as Second corpus.
In a kind of preferred embodiment of the present invention, first determining module includes:
First computational submodule obtains multiple for calculating the reverse document-frequency of each vocabulary in first corpus First reverse document-frequency value;
Second computational submodule, for calculating reverse text of each vocabulary in second corpus in first corpus Part frequency obtains the multiple second reverse document-frequency values;
Comparative sub-module, for when the first reverse document-frequency value of any vocabulary and the difference of the second reverse document-frequency value Value is more than or equal to when presetting reverse document-frequency value, determines that the vocabulary is technical term.
In a kind of preferred embodiment of the present invention, second determining module includes:
Location information acquisition submodule, for obtaining location information of the technical term in first corpus;Institute Stating location information includes mutual information and left and right entropy information;
First computational submodule, for according to the mutual information and left and right entropy information, obtain described first expect in it is new Technical term.
In a kind of preferred embodiment of the present invention, second determining module further include:
Second computational submodule, it is pre- for inputting the reverse document-frequency value of the technical term and the location information The conditional random field models set obtain technical term new in first corpus.
Technical term shown in the application one embodiment can be performed in the identification device of the technical term of the present embodiment Recognition methods, realization principle is similar, and details are not described herein again.
A kind of electronic equipment is provided in the another embodiment of the application, which includes: memory and processor; At least one program, is stored in memory, and when for being executed by processor, can realize compared with prior art: obtaining profession First corpus of the corresponding professional domain of term and second corpus in amateur field, are then based on the first corpus and the second language Material obtains reverse document-frequency value from the first corpus and is greater than the vocabulary for presetting reverse document-frequency value, and vocabulary is determined as Technical term is determined in this way by way of comparing the corpus of the corpus of professional domain and amateur field special In the corpus in industry field, the vocabulary being frequent, compared with the existing technology in simply by common participle technique identify text Technical term in this, the application can more completely identify technical term by way of contrast, to improve professional art The discrimination of language, and then improve the quality of natural language processing.
Further, reverse document-frequency value and location information then based on technical term obtain newly from the first corpus Technical term.It may further determine that out near the technical term the being frequent, technical term of appearance infrequently in this way, The discrimination of technical term is further improved, and improves the quality of natural language processing.
A kind of electronic equipment is provided in one alternate embodiment, as shown in figure 3, electronic equipment shown in Fig. 3 3000 It include: processor 3001 and memory 3003.Wherein, processor 3001 is connected with memory 3003, such as passes through 3002 phase of bus Even.Optionally, electronic equipment 3000 can also include transceiver 3004.It should be noted that transceiver 3004 in practical application It is not limited to one, the structure of the electronic equipment 3000 does not constitute the restriction to the embodiment of the present application.
Processor 3001 can be CPU, general processor, DSP, ASIC, FPGA or other programmable logic device, crystalline substance Body pipe logical device, hardware component or any combination thereof.It, which may be implemented or executes, combines described by present disclosure Various illustrative logic blocks, module and circuit.Processor 3001 is also possible to realize the combination of computing function, such as wraps It is combined containing one or more microprocessors, DSP and the combination of microprocessor etc..
Bus 3002 may include an access, and information is transmitted between said modules.Bus 3002 can be pci bus or Eisa bus etc..Bus 3002 can be divided into address bus, data/address bus, control bus etc..Only to be used in Fig. 3 convenient for indicating One thick line indicates, it is not intended that an only bus or a type of bus.
Memory 3003 can be ROM or can store the other kinds of static storage device of static information and instruction, RAM Or the other kinds of dynamic memory of information and instruction can be stored, it is also possible to EEPROM, CD-ROM or other CDs Storage, optical disc storage (including compression optical disc, laser disc, optical disc, Digital Versatile Disc, Blu-ray Disc etc.), magnetic disk storage medium Or other magnetic storage apparatus or can be used in carry or store have instruction or data structure form desired program generation Code and can by any other medium of computer access, but not limited to this.
Memory 3003 is used to store the application code for executing application scheme, and is held by processor 3001 to control Row.Processor 3001 is for executing the application code stored in memory 3003, to realize aforementioned either method embodiment Shown in content.
Wherein, electronic equipment includes but is not limited to: mobile phone, laptop, digit broadcasting receiver, PDA are (personal Digital assistants), PAD (tablet computer), PMP (portable media player), car-mounted terminal (such as vehicle mounted guidance terminal) etc. Deng mobile terminal and such as number TV, desktop computer etc. fixed terminal.
The another embodiment of the application provides a kind of computer readable storage medium, on the computer readable storage medium It is stored with computer program, when run on a computer, computer is executed corresponding in preceding method embodiment Content.Compared with prior art, the first corpus of the corresponding professional domain of technical term and second language in amateur field are obtained Material, be then based on the first corpus and the second corpus, obtained from the first corpus reverse document-frequency value be greater than preset reverse file The vocabulary of frequency values, and vocabulary is determined as technical term, in this way by by the language of the corpus of professional domain and amateur field Expect the mode that compares, determine the vocabulary being frequent in the corpus of professional domain, compared with the existing technology in only The technical term in text is identified by common participle technique, the application can be identified more completely by way of contrast Technical term out to improve the discrimination of technical term, and then improves the quality of natural language processing.
Further, reverse document-frequency value and location information then based on technical term obtain newly from the first corpus Technical term.It may further determine that out near the technical term the being frequent, technical term of appearance infrequently in this way, The discrimination of technical term is further improved, and improves the quality of natural language processing.
The method of the embodiment of the present invention further include:
A1, a kind of recognition methods of technical term, comprising:
The first corpus of the corresponding professional domain of technical term is obtained, and obtains second corpus in amateur field;
Based on first corpus and second corpus, reverse document-frequency value is obtained from first corpus and is greater than It is determined as technical term equal to the vocabulary for presetting reverse document-frequency value, and by the vocabulary;
Based on the location information of the technical term, new technical term is determined from first corpus.
The recognition methods of A2, technical term according to a1, it is described to obtain the of the corresponding professional domain of technical term One corpus, comprising:
The first text information is obtained from the associated webpage of the professional domain, using first text information as One corpus.
The step of recognition methods of A3, technical term according to a1, second corpus for obtaining amateur field, Include:
Obtain the second text information from the webpage of the professional domain dereferenced, using second text information as Second corpus.
The recognition methods of A4, technical term according to a1, it is described to be based on first corpus and second corpus, Reverse document-frequency value is obtained from first corpus be more than or equal to and preset the vocabulary of reverse document-frequency value, and by institute's predicate Remittance is determined as the step of technical term, comprising:
The reverse document-frequency for calculating each vocabulary in first corpus obtains the multiple first reverse document-frequency values;
Reverse document-frequency of each vocabulary in second corpus in first corpus is calculated, obtains multiple second Reverse document-frequency value;
It is preset when the first reverse document-frequency value of any vocabulary and the difference of the second reverse document-frequency value are more than or equal to When reverse document-frequency value, determine that the vocabulary is technical term.
The recognition methods of A5, technical term according to a1, the location information based on the technical term, from institute State the step of new technical term is determined in the first corpus, comprising:
Obtain location information of the technical term in first corpus;The location information includes mutual information and a left side Right entropy information;
According to the mutual information and left and right entropy information, technical term new in first expectation is obtained.
The recognition methods of A6, the technical term according to A1 or A5, the location information based on the technical term, The step of new technical term is determined from first corpus, further includes:
The reverse document-frequency value of the technical term and the location information are inputted into preset conditional random field models, Obtain technical term new in first corpus.
B7, a kind of identification device of technical term, comprising:
Module is obtained, for obtaining the first corpus of the corresponding professional domain of technical term, and obtains amateur field Second corpus;
First determining module is obtained from first corpus for being based on first corpus and second corpus Reverse document-frequency value is more than or equal to the vocabulary for presetting reverse document-frequency value, and the vocabulary is determined as technical term;
Second determining module determines newly from first corpus for the location information based on the technical term Technical term.
The identification device of B8, the technical term according to B7, the acquisition module are specifically used for:
The first text information is obtained from the associated webpage of the professional domain, using first text information as One corpus.
The identification device of B9, the technical term according to B7, the acquisition module are specifically used for:
Obtain the second text information from the webpage of the professional domain dereferenced, using second text information as Second corpus.
The identification device of B10, the technical term according to B7, first determining module include:
First computational submodule obtains multiple for calculating the reverse document-frequency of each vocabulary in first corpus First reverse document-frequency value;
Second computational submodule, for calculating reverse text of each vocabulary in second corpus in first corpus Part frequency obtains the multiple second reverse document-frequency values;
Comparative sub-module, for when the first reverse document-frequency value of any vocabulary and the difference of the second reverse document-frequency value Value is more than or equal to when presetting reverse document-frequency value, determines that the vocabulary is technical term.
The identification device of B11, the technical term according to B7, second determining module include:
Location information acquisition submodule, for obtaining location information of the technical term in first corpus;Institute Stating location information includes mutual information and left and right entropy information;
First computational submodule, for according to the mutual information and left and right entropy information, obtain described first expect in it is new Technical term.
The identification device of B12, the technical term according to B7 or B11, second determining module further include:
Second computational submodule, it is pre- for inputting the reverse document-frequency value of the technical term and the location information The conditional random field models set obtain technical term new in first corpus.
C13, a kind of electronic equipment comprising:
Processor, memory and bus;
The bus, for connecting the processor and the memory;
The memory, for storing operational order;
The processor, for executing professional art described in any one of above-mentioned A1-A6 by calling the operational order The recognition methods of language.
D14, a kind of computer readable storage medium, the computer storage medium is for storing computer instruction, when it When running on computers, computer is allowed to execute the recognition methods of technical term described in any one of above-mentioned A1-A6.
It should be understood that although each step in the flow chart of attached drawing is successively shown according to the instruction of arrow, These steps are not that the inevitable sequence according to arrow instruction successively executes.Unless expressly stating otherwise herein, these steps Execution there is no stringent sequences to limit, can execute in the other order.Moreover, at least one in the flow chart of attached drawing Part steps may include that perhaps these sub-steps of multiple stages or stage are not necessarily in synchronization to multiple sub-steps Completion is executed, but can be executed at different times, execution sequence, which is also not necessarily, successively to be carried out, but can be with other At least part of the sub-step or stage of step or other steps executes in turn or alternately.
The above is only some embodiments of the invention, it is noted that for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims (10)

1. a kind of recognition methods of technical term characterized by comprising
The first corpus of the corresponding professional domain of technical term is obtained, and obtains second corpus in amateur field;
Based on first corpus and second corpus, reverse document-frequency value is obtained from first corpus and is more than or equal to The vocabulary of reverse document-frequency value is preset, and the vocabulary is determined as technical term;
Based on the location information of the technical term, new technical term is determined from first corpus.
2. the recognition methods of technical term according to claim 1, which is characterized in that the acquisition technical term is corresponding First corpus of professional domain, comprising:
The first text information is obtained from the associated webpage of the professional domain, using first text information as the first language Material.
3. the recognition methods of technical term according to claim 1, which is characterized in that described to obtain the of amateur field The step of two corpus, comprising:
The second text information is obtained from the webpage of the professional domain dereferenced, using second text information as second Corpus.
4. the recognition methods of technical term according to claim 1, which is characterized in that it is described based on first corpus and Second corpus obtains reverse document-frequency value from first corpus and is more than or equal to the word for presetting reverse document-frequency value It converges, and the step of vocabulary is determined as technical term, comprising:
The reverse document-frequency for calculating each vocabulary in first corpus obtains the multiple first reverse document-frequency values;
Reverse document-frequency of each vocabulary in second corpus in first corpus is calculated, it is reverse to obtain multiple second Document-frequency value;
It is preset inversely when the first reverse document-frequency value of any vocabulary and the difference of the second reverse document-frequency value are more than or equal to When document-frequency value, determine that the vocabulary is technical term.
5. the recognition methods of technical term according to claim 1, which is characterized in that described based on the technical term Location information, the step of new technical term is determined from first corpus, comprising:
Obtain location information of the technical term in first corpus;The location information includes mutual information and left and right entropy Information;
According to the mutual information and left and right entropy information, technical term new in first expectation is obtained.
6. the recognition methods of technical term according to claim 1 or 5, which is characterized in that described based on the professional art The location information of language, the step of new technical term is determined from first corpus, further includes:
The reverse document-frequency value of the technical term and the location information are inputted into preset conditional random field models, obtained New technical term in first corpus.
7. a kind of identification device of technical term characterized by comprising
Module is obtained, for obtaining the first corpus of the corresponding professional domain of technical term, and obtains the second of amateur field Corpus;
First determining module obtains reverse for being based on first corpus and second corpus from first corpus Document-frequency value is more than or equal to the vocabulary for presetting reverse document-frequency value, and the vocabulary is determined as technical term;
Second determining module determines new profession for the location information based on the technical term from first corpus Term.
8. the identification device of technical term according to claim 7, which is characterized in that the acquisition module is specifically used for:
The first text information is obtained from the associated webpage of the professional domain, using first text information as the first language Material.
9. a kind of electronic equipment, characterized in that it comprises:
Processor, memory and bus;
The bus, for connecting the processor and the memory;
The memory, for storing operational order;
The processor, for executing profession described in any one of the claims 1-6 by calling the operational order The recognition methods of term.
10. a kind of computer readable storage medium, which is characterized in that the computer storage medium refers to for storing computer It enables, when run on a computer, computer is allowed to execute professional art described in any one of the claims 1-6 The recognition methods of language.
CN201910457246.4A 2019-05-29 2019-05-29 Method and device for identifying professional terms, electronic equipment and readable storage medium Active CN110175331B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910457246.4A CN110175331B (en) 2019-05-29 2019-05-29 Method and device for identifying professional terms, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910457246.4A CN110175331B (en) 2019-05-29 2019-05-29 Method and device for identifying professional terms, electronic equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN110175331A true CN110175331A (en) 2019-08-27
CN110175331B CN110175331B (en) 2021-05-11

Family

ID=67695908

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910457246.4A Active CN110175331B (en) 2019-05-29 2019-05-29 Method and device for identifying professional terms, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN110175331B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111950274A (en) * 2020-07-31 2020-11-17 中国工商银行股份有限公司 Chinese word segmentation method and device for linguistic data in professional field

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049501A (en) * 2012-12-11 2013-04-17 上海大学 Chinese domain term recognition method based on mutual information and conditional random field model
CN103995876A (en) * 2014-05-26 2014-08-20 上海大学 Text classification method based on chi square statistics and SMO algorithm
CN104572622A (en) * 2015-01-05 2015-04-29 语联网(武汉)信息技术有限公司 Term filtering method
US20160239482A1 (en) * 2015-02-13 2016-08-18 International Business Machines Corporation Identifying word-senses based on linguistic variations
CN106445907A (en) * 2015-08-06 2017-02-22 北京国双科技有限公司 Domain lexicon generation method and apparatus
CN106776558A (en) * 2016-12-14 2017-05-31 北京工业大学 Merge the domain term recognition method of language ambience information

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049501A (en) * 2012-12-11 2013-04-17 上海大学 Chinese domain term recognition method based on mutual information and conditional random field model
CN103995876A (en) * 2014-05-26 2014-08-20 上海大学 Text classification method based on chi square statistics and SMO algorithm
CN104572622A (en) * 2015-01-05 2015-04-29 语联网(武汉)信息技术有限公司 Term filtering method
US20160239482A1 (en) * 2015-02-13 2016-08-18 International Business Machines Corporation Identifying word-senses based on linguistic variations
CN106445907A (en) * 2015-08-06 2017-02-22 北京国双科技有限公司 Domain lexicon generation method and apparatus
CN106776558A (en) * 2016-12-14 2017-05-31 北京工业大学 Merge the domain term recognition method of language ambience information

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111950274A (en) * 2020-07-31 2020-11-17 中国工商银行股份有限公司 Chinese word segmentation method and device for linguistic data in professional field

Also Published As

Publication number Publication date
CN110175331B (en) 2021-05-11

Similar Documents

Publication Publication Date Title
US11281860B2 (en) Method, apparatus and device for recognizing text type
US20200250379A1 (en) Method and apparatus for textual semantic encoding
US11042542B2 (en) Method and apparatus for providing aggregate result of question-and-answer information
US7725466B2 (en) High accuracy document information-element vector encoding server
JP2017010514A (en) Search engine and method for implementing the same
Wan et al. Composite feature extraction and selection for text classification
US9514113B1 (en) Methods for automatic footnote generation
Han et al. State complexity of basic operations on suffix-free regular languages
CN108171576B (en) Order processing method and device, electronic equipment and computer readable storage medium
CN104866478A (en) Detection recognition method and device of malicious text
CN110674635B (en) Method and device for dividing text paragraphs
CN110175331A (en) Recognition methods, device, electronic equipment and the readable storage medium storing program for executing of technical term
CN110390011B (en) Data classification method and device
CN113505595A (en) Text phrase extraction method and device, computer equipment and storage medium
Kim et al. Toward privacy-preserving text embedding similarity with homomorphic encryption
US20090164430A1 (en) System and method for acquiring contact information
Lee et al. GLEN: Generative retrieval via lexical index learning
CN113761565B (en) Data desensitization method and device
US20160335053A1 (en) Generating compact representations of high-dimensional data
CN110738042B (en) Error correction dictionary creation method, device, terminal and computer storage medium
CN112417874A (en) Named entity recognition method and device, storage medium and electronic device
CN109740130B (en) Method and device for generating file
EP3195146A1 (en) Three-dimensional latent semantic analysis
CN107832341B (en) AGNSS user duplicate removal statistical method
CN114742058A (en) Named entity extraction method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20200806

Address after: 518057 Nanshan District science and technology zone, Guangdong, Zhejiang Province, science and technology in the Tencent Building on the 1st floor of the 35 layer

Applicant after: TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd.

Address before: 100029, Beijing, Chaoyang District new East Street, building No. 2, -3 to 25, 101, 8, 804 rooms

Applicant before: Tricorn (Beijing) Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant