CN111274361A - Industry new word discovery method and device, storage medium and electronic equipment - Google Patents

Industry new word discovery method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN111274361A
CN111274361A CN202010068920.2A CN202010068920A CN111274361A CN 111274361 A CN111274361 A CN 111274361A CN 202010068920 A CN202010068920 A CN 202010068920A CN 111274361 A CN111274361 A CN 111274361A
Authority
CN
China
Prior art keywords
words
word
industry
word segmentation
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010068920.2A
Other languages
Chinese (zh)
Inventor
李亮
蔺文萃
罗利利
李文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Mininglamp Software System Co ltd
Original Assignee
Beijing Mininglamp Software System Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Mininglamp Software System Co ltd filed Critical Beijing Mininglamp Software System Co ltd
Priority to CN202010068920.2A priority Critical patent/CN111274361A/en
Publication of CN111274361A publication Critical patent/CN111274361A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The application provides an industry new word discovery method, an industry new word discovery device, a storage medium and electronic equipment. The method comprises the steps of firstly, loading an industry word bank and a disabled word bank into a word segmentation model, inputting a text to be retrieved as an input of the word segmentation model to obtain a word segmentation result, analyzing the word segmentation result according to a left-right mutual information entropy algorithm to obtain a result set, and screening out target words by the word segmentation model according to the industry word bank and the disabled word bank, wherein the target words are vocabularies which are related to the industry and have business significance, so that the situation that meaningless or unrelated new words or phrases are obtained is avoided, interference is eliminated, the accuracy and the effectiveness of the target words relative to the industry are guaranteed, and further analysis of the target words by workers is facilitated.

Description

Industry new word discovery method and device, storage medium and electronic equipment
Technical Field
The application relates to the field of natural language processing, in particular to a method and a device for discovering new words in industry, a storage medium and electronic equipment.
Background
With the rapid development of national economy, crime means and methods are remarkably changed, and novel fraud cases represented by telecommunication fraud are more on an increasing level year by year; with the development of internet technology and network culture, the characteristics of cases also show diversification and trend, particularly, new network expressions and popular vocabularies often appear in the description of the police situations and the cases, and the mining and discovery of the new vocabularies can effectively understand the occurrence and characteristic information of recent cases, thereby being helpful for the prevention and the detection of the cases. Therefore, from the recently occurring alert texts or brief case text information, it is very critical to find new words or phrases that can represent typical features of the case. By the accurate mining and discovery of new words or phrases, the semantic features and rules of recent alarm/cases can be accurately represented, so that preventive measures for related cases are made in time, and more clues are provided for the case detection. The prior art often obtains new words or phrases which are meaningless or irrelevant, which does not help the case, but may interfere with the judgment of the case.
Disclosure of Invention
The present application aims to provide an industry new word discovery method, an industry new word discovery device, a storage medium and an electronic device, so as to solve the above problems.
In order to achieve the above purpose, the embodiments of the present application employ the following technical solutions:
in a first aspect, an embodiment of the present application provides an industry new word discovery method, where the method includes:
loading an industry word bank and a disused word bank into a word segmentation model, wherein the industry word bank comprises basic words and industry words, and the disused word bank comprises disused words, habitual words and words without business meanings;
taking a text to be retrieved as an input of the word segmentation model to obtain a word segmentation result, wherein the word segmentation result comprises words and/or phrases in the text to be retrieved;
analyzing the word segmentation result according to a left-right mutual information entropy algorithm to obtain a result set, wherein the result set comprises target words in the text to be retrieved, and the information entropy of the target words is larger than or equal to an information entropy threshold.
In a second aspect, an embodiment of the present application provides an industry new word discovery apparatus, including:
the system comprises an information loading unit, a word segmentation model and a word segmentation model, wherein the information loading unit is used for loading an industry word bank and a non-use word bank into the word segmentation model, the industry word bank comprises basic words and industry words, and the non-use word bank comprises non-use words, custom words and words without business meanings;
the processing unit is used for taking a text to be retrieved as the input of the word segmentation model so as to obtain a word segmentation result, wherein the word segmentation result comprises words and/or phrases in the text to be retrieved; and the word segmentation result is analyzed according to a left-right mutual information entropy algorithm to obtain a result set, wherein the result set comprises target words in the text to be retrieved, and the information entropy of the target words is greater than or equal to an information entropy threshold.
In a third aspect, the present application provides a storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the method described above.
In a fourth aspect, an embodiment of the present application provides an electronic device, including: a processor and memory for storing one or more programs; the one or more programs, when executed by the processor, implement the methods described above.
Compared with the prior art, the industry new word discovery method, the industry new word discovery device, the industry new word discovery storage medium and the electronic equipment provided by the embodiment of the application have the beneficial effects that: the method comprises the steps of firstly, loading an industry word bank and a disabled word bank into a word segmentation model, inputting a text to be retrieved as an input of the word segmentation model to obtain a word segmentation result, analyzing the word segmentation result according to a left-right mutual information entropy algorithm to obtain a result set, and screening out target words by the word segmentation model according to the industry word bank and the disabled word bank, wherein the target words are vocabularies which are related to the industry and have business significance, so that the situation that meaningless or unrelated new words or phrases are obtained is avoided, interference is eliminated, the accuracy and the effectiveness of the target words relative to the industry are guaranteed, and further analysis of the target words by workers is facilitated.
In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and it will be apparent to those skilled in the art that other related drawings can be obtained from the drawings without inventive effort.
Fig. 1 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure;
fig. 2 is a schematic flowchart of an industry neologism discovery method according to an embodiment of the present application;
fig. 3 is another schematic flow chart of an industry new word discovery method according to an embodiment of the present application;
fig. 4 is another schematic flow chart of an industry new word discovery method provided in an embodiment of the present application;
fig. 5 is another schematic flow chart of an industry new word discovery method according to an embodiment of the present application;
fig. 6 is another schematic flow chart of an industry new word discovery method according to an embodiment of the present application;
fig. 7 is a schematic unit diagram of an industry neologism discovery apparatus according to an embodiment of the present application.
In the figure: 10-a processor; 11-a memory; 12-a bus; 13-a communication interface; 201-an information loading unit; 202-processing unit.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
In the description of the present application, it should be noted that the terms "upper", "lower", "inner", "outer", and the like indicate orientations or positional relationships based on orientations or positional relationships shown in the drawings or orientations or positional relationships conventionally found in use of products of the application, and are used only for convenience in describing the present application and for simplification of description, but do not indicate or imply that the referred devices or elements must have a specific orientation, be constructed in a specific orientation, and be operated, and thus should not be construed as limiting the present application.
In the description of the present application, it is also to be noted that, unless otherwise explicitly specified or limited, the terms "disposed" and "connected" are to be interpreted broadly, e.g., as being either fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meaning of the above terms in the present application can be understood in a specific case by those of ordinary skill in the art.
Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.
The embodiment of the application provides electronic equipment which can be a mobile phone, a computer or other intelligent terminals. Please refer to fig. 1, a schematic structural diagram of an electronic device. The electronic device comprises a processor 10, a memory 11, a bus 12. The processor 10 and the memory 11 are connected by a bus 12, and the processor 10 is configured to execute an executable module, such as a computer program, stored in the memory 11.
The processor 10 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the industry new word discovery method may be performed by instructions in the form of hardware integrated logic circuits or software in the processor 10. The Processor 10 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.
The Memory 11 may comprise a high-speed Random Access Memory (RAM) and may further comprise a non-volatile Memory (non-volatile Memory), such as at least one disk Memory.
The bus 12 may be an ISA (Industry Standard architecture) bus, a PCI (peripheral component interconnect) bus, an EISA (extended Industry Standard architecture) bus, or the like. Only one bi-directional arrow is shown in fig. 1, but this does not indicate only one bus 12 or one type of bus 12.
The memory 11 is used for storing programs, such as programs corresponding to the industry new word discovery device. The industry new word discovery apparatus includes at least one software function module which may be stored in the memory 11 in the form of software or firmware (firmware) or solidified in an Operating System (OS) of the electronic device. The processor 10 executes the program to implement the industry new word discovery method after receiving the execution instruction.
Possibly, the electronic device provided by the embodiment of the present application further includes a communication interface 13. The communication interface 13 is connected to the processor 10 via a bus. The electronic device may receive the text corpus sent by other terminals through the communication interface 13.
It should be understood that the structure shown in fig. 1 is merely a structural schematic diagram of a portion of an electronic device, which may also include more or fewer components than shown in fig. 1, or have a different configuration than shown in fig. 1. The components shown in fig. 1 may be implemented in hardware, software, or a combination thereof.
The method for discovering new words in industry provided by the embodiment of the present invention can be applied to, but is not limited to, the electronic device shown in fig. 1, and please refer to fig. 2:
s103, loading the industry word stock and the disabled word stock into the word segmentation model.
The industry word bank comprises basic words and industry words, and the disuse word bank comprises disuse words, custom words and words without service meanings.
In particular, the base vocabulary may not include industry vocabulary. For example, the word a does not belong to the basic vocabulary, and the word a is a new word relative to the basic vocabulary; however, the word a belongs to the vocabulary in the public security industry, so the word a does not belong to a new word for the public security industry. Also, there are words that are dead words, idiomatic words, or words that are of no business significance, which may be new words with respect to a particular industry, but which are of no significance to that particular industry. Therefore, the word stock needs to be added and removed to be loaded into the word segmentation model, so that the influence of the meaningless words can be avoided.
And S104, taking the text to be retrieved as the input of the word segmentation model to obtain a word segmentation result.
The word segmentation result comprises words and/or phrases in the text to be retrieved.
Specifically, the word segmentation model analyzes the text to be retrieved according to the vocabularies in the industry word bank and the disabled word bank, and possibly performs semantic analysis to obtain word segmentation results. For example, the text to be retrieved may include "the person is dying", and the segmentation result may include: the person is, is dying. This example is for ease of understanding only and does not limit the specific word segmentation.
And S107, analyzing the word segmentation result according to the left-right mutual information entropy algorithm to obtain a result set.
And the result set comprises target words in the text to be retrieved, and the information entropy of the target words is greater than or equal to the information entropy threshold. Possibly, the information entropy threshold is 0.2. Through a large number of summary experiments of the inventor, when the information entropy threshold is 0.2, the accuracy rate that the target word in the obtained result set is a new word is high, and the probability of the new word which is not identified is low.
Referring to the above example, assuming that "dying" does not belong to the industry thesaurus and the decommissioned thesaurus, then "dying" is the target word in the result set.
To sum up, in the industry new word discovery method provided in the embodiment of the present application: the method comprises the steps of firstly, loading an industry word bank and a disabled word bank into a word segmentation model, inputting a text to be retrieved as an input of the word segmentation model to obtain a word segmentation result, analyzing the word segmentation result according to a left-right mutual information entropy algorithm to obtain a result set, and screening out target words by the word segmentation model according to the industry word bank and the disabled word bank, wherein the target words are vocabularies which are related to the industry and have business significance, so that the situation that meaningless or unrelated new words or phrases are obtained is avoided, interference is eliminated, the accuracy and the effectiveness of the target words relative to the industry are guaranteed, and further analysis of the target words by workers is facilitated.
For the word segmentation model, the embodiment of the application also provides a possible implementation mode, and results based on the built-in NLP word segmentation and Standard word segmentation models of HanLP are fused to obtain word segmentation results.
On the basis of fig. 2, regarding how to ensure the association between new words and industries, the embodiment of the present application further provides a possible implementation manner, please refer to fig. 3, where the industry new word discovery method further includes:
s106, screening the word segmentation result to enable the frequency of the words or phrases in the word segmentation result to be larger than or equal to the preset frequency.
In particular, in one possible implementation, the text to be retrieved may contain portions of occasional vocabularies that are not relevant to a particular industry. In order to eliminate the interference of the accidental words, the word segmentation result needs to be screened. The frequency of the occurrence of the words or phrases is more than or equal to the preset frequency, namely the words or phrases are represented not to occur by chance but to have certain relevance with the text to be retrieved.
On the basis of fig. 2, in order to satisfy the temporal and regional conditions, a possible implementation manner is further provided in the embodiment of the present application, please refer to fig. 4, where the method for discovering an industry new word further includes:
and S105, taking the word segmentation result as the input of the first filtering model to obtain the filtered word segmentation result.
Wherein the first filtering model is used for filtering words or phrases which do not meet the time condition, the place condition, the part of speech condition and the scene condition.
Specifically, the demand range for new words is not consistent under different scenes or at different times. In order to avoid that the vocabularies in the word segmentation result do not meet time conditions, location conditions, part of speech conditions and scene conditions, the word segmentation result is filtered through the first filtering model, and the vocabularies with actual service meanings are finally reserved as the final result of word segmentation.
On the basis of fig. 2, in order to screen out duplicate words in the result set, the embodiment of the present application further provides a possible implementation manner, please refer to fig. 5, and the method for discovering new words in industry further includes:
and S108, taking the result set as the input of the second filtering model to obtain a filtered result set.
Wherein the second filtering model is used for filtering repeated words or words with inclusion relation.
In particular, for example, the result set contains "WeChat" and "WeChat Payment" because WeChat Payment contains WeChat, repeated words or words having an inclusive relationship may be filtered through the second filtering model.
On the basis of fig. 2, regarding how to obtain an industry thesaurus and a decommissioning thesaurus, a possible implementation manner is further provided in the embodiment of the present application, please refer to fig. 6, where the industry new word discovery method further includes:
and S101, generating an industry word bank according to the industry text corpus and the public basic text corpus.
Specifically, the industry corpus may be a public security industry corpus, and the basic corpus may be wikipedia, hundred degree encyclopedia, article literature, or other common general knowledge.
S102, generating a deactivation word bank according to the idiom text corpus, the non-service meaning word text corpus and the deactivation word text corpus.
Possibly, a word bank is obtained by training by adopting traditional word2vec, glove and other methods, and the word bank comprises an industry word bank and a stop word bank.
In a possible implementation manner, S102 may be performed before S101, and the order of performing S101 and S102 is not limited herein.
Referring to fig. 7, fig. 7 is a diagram of an industry new word discovery apparatus according to an embodiment of the present application, and optionally, the industry new word discovery apparatus is applied to the electronic device described above.
The industry new word discovery device comprises: an information loading unit 201 and a processing unit 202.
The information loading unit 201 is used for loading an industry word bank and a disabled word bank into the word segmentation model, wherein the industry word bank comprises basic words and industry words, and the disabled word bank comprises disabled words, habitual words and words without business meanings. Specifically, the information loading unit 201 may execute S103 described above.
The processing unit 202 is configured to use a text to be retrieved as an input of a word segmentation model to obtain a word segmentation result, where the word segmentation result includes a word and/or a phrase in the text to be retrieved; and the word segmentation result is analyzed according to a left-right mutual information entropy algorithm to obtain a result set, wherein the result set comprises target words in the text to be retrieved, and the information entropy of the target words is greater than or equal to an information entropy threshold. Specifically, the processing unit 202 may execute S104 and S107 described above.
In a possible implementation manner, the processing unit 202 is further configured to filter the segmentation result so that the number of occurrences of a word or a phrase in the segmentation result is greater than or equal to a preset frequency. Specifically, the processing unit 202 may execute S106 described above.
In a possible implementation manner, the processing unit 202 is further configured to use the segmentation result as an input of a first filtering model to obtain a filtered segmentation result, wherein the first filtering model is used for filtering words or phrases that do not satisfy the time condition, the location condition, the part-of-speech condition, and the scene condition. Specifically, the processing unit 202 may execute S105 described above.
It should be noted that the apparatus for discovering new terms in industry provided in this embodiment may execute the method flows shown in the above method flow embodiments to achieve corresponding technical effects. For the sake of brevity, the corresponding contents in the above embodiments may be referred to where not mentioned in this embodiment.
The embodiment of the invention also provides a storage medium, wherein the storage medium stores computer instructions and programs, and the computer instructions and the programs execute the industry new word discovery method of the embodiment when being read and run. The storage medium may include memory, flash memory, registers, or a combination thereof, etc.
The following provides an electronic device, which may be a mobile phone, a computer, or another intelligent terminal, and as shown in fig. 1, the electronic device may implement the above-mentioned industry new word discovery method. Specifically, the electronic device includes: processor 10, memory 11, bus 12. The processor 10 may be a CPU. The memory 11 is used for storing one or more programs, and when the one or more programs are executed by the processor 10, the industry new word discovery method of the above-described embodiment is performed.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.
It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Claims (10)

1. An industry neologism discovery method, comprising:
loading an industry word bank and a disused word bank into a word segmentation model, wherein the industry word bank comprises basic words and industry words, and the disused word bank comprises disused words, habitual words and words without business meanings;
taking a text to be retrieved as an input of the word segmentation model to obtain a word segmentation result, wherein the word segmentation result comprises words and/or phrases in the text to be retrieved;
analyzing the word segmentation result according to a left-right mutual information entropy algorithm to obtain a result set, wherein the result set comprises target words in the text to be retrieved, and the information entropy of the target words is larger than or equal to an information entropy threshold.
2. The industry neologism discovery method of claim 1, wherein prior to analyzing the word segmentation results according to a left-right mutual information entropy algorithm, the method further comprises:
and screening the word segmentation result to enable the frequency of the words or phrases in the word segmentation result to be larger than or equal to the preset frequency.
3. The industry neologism discovery method of claim 1, wherein after entering text to be retrieved as an input to the word segmentation model to obtain a word segmentation result, the method comprises:
and taking the word segmentation result as an input of a first filtering model to obtain a filtered word segmentation result, wherein the first filtering model is used for filtering words or phrases which do not meet a time condition, a place condition, a part of speech condition and a scene condition.
4. The industry neologism discovery method of claim 1, wherein after obtaining a result set, the method further comprises:
and taking the result set as an input of a second filtering model to obtain a filtered result set, wherein the second filtering model is used for filtering repeated words or words with inclusion relations.
5. The industry new word discovery method of claim 1, wherein prior to loading an industry thesaurus and deactivating the thesaurus into the segmentation model, the method further comprises:
generating the industry word stock according to the industry text corpus and the public basic text corpus;
and generating the deactivation word bank according to the idiomatic phrase text corpus, the non-service meaning word text corpus and the deactivation word text corpus.
6. An apparatus for discovering new words in industry, the apparatus comprising:
the system comprises an information loading unit, a word segmentation model and a word segmentation model, wherein the information loading unit is used for loading an industry word bank and a non-use word bank into the word segmentation model, the industry word bank comprises basic words and industry words, and the non-use word bank comprises non-use words, custom words and words without business meanings;
the processing unit is used for taking a text to be retrieved as the input of the word segmentation model so as to obtain a word segmentation result, wherein the word segmentation result comprises words and/or phrases in the text to be retrieved; and the word segmentation result is analyzed according to a left-right mutual information entropy algorithm to obtain a result set, wherein the result set comprises target words in the text to be retrieved, and the information entropy of the target words is greater than or equal to an information entropy threshold.
7. The apparatus as claimed in claim 6, wherein the processing unit is further configured to filter the segmentation result so that the number of occurrences of words or phrases in the segmentation result is greater than or equal to a preset frequency.
8. The industry novice discovery apparatus of claim 6, wherein said processing unit is further configured to use said segmentation result as an input to a first filtering model to obtain a filtered segmentation result, wherein said first filtering model is configured to filter words or phrases that do not satisfy a time condition, a location condition, a part-of-speech condition, and a scene condition.
9. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-5.
10. An electronic device, comprising: a processor and memory for storing one or more programs; the one or more programs, when executed by the processor, implement the method of any of claims 1-5.
CN202010068920.2A 2020-01-21 2020-01-21 Industry new word discovery method and device, storage medium and electronic equipment Pending CN111274361A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010068920.2A CN111274361A (en) 2020-01-21 2020-01-21 Industry new word discovery method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010068920.2A CN111274361A (en) 2020-01-21 2020-01-21 Industry new word discovery method and device, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN111274361A true CN111274361A (en) 2020-06-12

Family

ID=71003275

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010068920.2A Pending CN111274361A (en) 2020-01-21 2020-01-21 Industry new word discovery method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN111274361A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111832299A (en) * 2020-07-17 2020-10-27 成都信息工程大学 Chinese word segmentation system
CN112183089A (en) * 2020-09-25 2021-01-05 中国建设银行股份有限公司 Corpus analysis method and device, electronic equipment and storage medium
CN113177410A (en) * 2021-05-07 2021-07-27 多点(深圳)数字科技有限公司 Text word segmentation method and device, storage medium and electronic equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104199972A (en) * 2013-09-22 2014-12-10 中科嘉速(北京)并行软件有限公司 Named entity relation extraction and construction method based on deep learning
US20160283583A1 (en) * 2014-03-14 2016-09-29 Tencent Technology (Shenzhen) Company Limited Method, apparatus, and storage medium for text information processing
CN108038234A (en) * 2017-12-26 2018-05-15 众安信息技术服务有限公司 A kind of question sentence template automatic generation method and device
CN108509425A (en) * 2018-04-10 2018-09-07 中国人民解放军陆军工程大学 A kind of Chinese new word discovery method based on novel degree
CN108595433A (en) * 2018-05-02 2018-09-28 北京中电普华信息技术有限公司 A kind of new word discovery method and device
CN109408818A (en) * 2018-10-12 2019-03-01 平安科技(深圳)有限公司 New word identification method, device, computer equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104199972A (en) * 2013-09-22 2014-12-10 中科嘉速(北京)并行软件有限公司 Named entity relation extraction and construction method based on deep learning
US20160283583A1 (en) * 2014-03-14 2016-09-29 Tencent Technology (Shenzhen) Company Limited Method, apparatus, and storage medium for text information processing
CN108038234A (en) * 2017-12-26 2018-05-15 众安信息技术服务有限公司 A kind of question sentence template automatic generation method and device
CN108509425A (en) * 2018-04-10 2018-09-07 中国人民解放军陆军工程大学 A kind of Chinese new word discovery method based on novel degree
CN108595433A (en) * 2018-05-02 2018-09-28 北京中电普华信息技术有限公司 A kind of new word discovery method and device
CN109408818A (en) * 2018-10-12 2019-03-01 平安科技(深圳)有限公司 New word identification method, device, computer equipment and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111832299A (en) * 2020-07-17 2020-10-27 成都信息工程大学 Chinese word segmentation system
CN112183089A (en) * 2020-09-25 2021-01-05 中国建设银行股份有限公司 Corpus analysis method and device, electronic equipment and storage medium
CN113177410A (en) * 2021-05-07 2021-07-27 多点(深圳)数字科技有限公司 Text word segmentation method and device, storage medium and electronic equipment
CN113177410B (en) * 2021-05-07 2023-04-25 多点(深圳)数字科技有限公司 Text word segmentation method and device, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN108763952B (en) Data classification method and device and electronic equipment
CN109544166B (en) Risk identification method and risk identification device
CN111274361A (en) Industry new word discovery method and device, storage medium and electronic equipment
CN109918678B (en) Method and device for identifying field meaning
CN107239694A (en) A kind of Android application permissions inference method and device based on user comment
CN111767713A (en) Keyword extraction method and device, electronic equipment and storage medium
CN111353488A (en) Method, device and equipment for identifying risks in code image
CN110263817B (en) Risk grade classification method and device based on user account
KR102166102B1 (en) Device and storage medium for protecting privacy information
CN108804563B (en) Data labeling method, device and equipment
CN110443291B (en) Model training method, device and equipment
CN116860963A (en) Text classification method, equipment and storage medium
CN111881288A (en) Method and device for judging authenticity of record information, storage medium and electronic equipment
CN115905885A (en) Data identification method, device, storage medium and program product
CN108763209B (en) Method, device and equipment for feature extraction and risk identification
CN110705258A (en) Text entity identification method and device
CN112597287B (en) Statement processing method, statement processing device and intelligent equipment
CN115544214A (en) Event processing method and device and computer readable storage medium
CN113220949B (en) Construction method and device of private data identification system
CN112541357B (en) Entity identification method and device and intelligent equipment
CN111209747B (en) Word vector file loading method and device, storage medium and electronic equipment
CN114706766A (en) False alarm elimination method and device of security function, electronic equipment and storage medium
CN109033070B (en) Data processing method, server and computer readable medium
Yeh et al. A fraud detection system for real-time messaging communication on Android Facebook messenger
CN108536855A (en) Mobile communications device evidence collecting method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200612

WD01 Invention patent application deemed withdrawn after publication