CN105260482A - Network new word discovery device and method based on crowdsourcing technology - Google Patents

Network new word discovery device and method based on crowdsourcing technology Download PDF

Info

Publication number
CN105260482A
CN105260482A CN201510785868.1A CN201510785868A CN105260482A CN 105260482 A CN105260482 A CN 105260482A CN 201510785868 A CN201510785868 A CN 201510785868A CN 105260482 A CN105260482 A CN 105260482A
Authority
CN
China
Prior art keywords
neologisms
new word
module
network
word discovery
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510785868.1A
Other languages
Chinese (zh)
Inventor
梁颖红
徐楠
杨荣根
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jinling Institute of Technology
Original Assignee
Jinling Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jinling Institute of Technology filed Critical Jinling Institute of Technology
Priority to CN201510785868.1A priority Critical patent/CN105260482A/en
Publication of CN105260482A publication Critical patent/CN105260482A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a network new word discovery device based on the crowdsourcing technology. The device comprises a new word obtaining module, a new word screening module, a learning training module, a network new word corpus module, a new word discovery module and a new word outputting module, wherein the new word obtaining module is used for the mass crowd through a game client, the mass crowd participate in network new word labeling through the game client, the new word screening module screens out network new words labeled by the mass crowd and remove repeated words and sensitive words, the learning training module adopts the network new words as seeds and inputs the seeds into a support vector machine for incremental learning training, the network new word corpus module stores and records the network new words and inputs the network new words into a network new word corpus, the new word discovery module classifies the new words according to categories in the network new word corpus and discovers the network new words according to a preset algorithm, and the new word outputting module arranges and stores the found network new words and meanwhile outputs the network new words. According to the network new word discovery device, the new word discovery period is shortened, the labor cost input is lowered, and therefore the new word discovery cost is lowered.

Description

Based on network new word discovery device and the method for mass-rent technology
Technical field
The present invention relates to network new word discovery technical field, particularly relate to the network new word discovery devices and methods therefor based on mass-rent technology.
Background technology
Along with the universal of internet and the development of technology, people more and more depend on network, Internet chat and shopping are day by day risen, thereupon, a large amount of network neologisms are there are, therefore finding that network neologisms become an important research content of natural language research field, is the correlative study of language material resource with network text by impact on the accuracy rate of network new word identification.
But network Development is rapid, and neologisms emerge in an endless stream, adopt the method for expert search and mark by man power and materials a large amount of for cost.Although more existing systems approaches based on new word discovery and device at present, all not to the utmost as any, and needs the discovery procedure of special personnel to neologisms to gather, edit and process, cost is comparatively large, and the cycle is relatively long.
Summary of the invention
In order to overcome above-mentioned the deficiencies in the prior art, the invention provides the network new word discovery devices and methods therefor based on mass-rent technology, its object is to the cycle shortening new word discovery, and reducing the input of cost of labor, thus reducing the cost of new word discovery.
The technical solution adopted in the present invention is: based on the network new word discovery device of mass-rent technology, comprise connect successively neologisms acquisition module, neologisms screening module, learning training module, net new word material storehouse, new word discovery module and neologisms output module; Wherein:
Neologisms acquisition module, by game client towards general population, general population is by game client participation network mark neologisms;
Neologisms screening module, the network neologisms of screening general population mark, remove repeated vocabulary and responsive vocabulary;
Learning training module, using network neologisms as seed, input support vector machine classifier carries out incremental learning training;
Net new word material module, stores and records network neologisms, is inputted net new word material storehouse;
Neologisms are classified according to classification by new word discovery module in net new word material storehouse, and find network neologisms according to pre-defined algorithm;
Neologisms output module, arranges and stores the network neologisms found, makes the neologisms of each classification of new word discovery module form the neologisms to be called such as independent, exports from output module.Network new word discovery devices and methods therefor based on mass-rent technology of the present invention, adopts game client as the acquisition end of neologisms acquisition module, saves suitable human cost.And it is filtered and rough handling the neologisms that masses submit to by automatic neologisms screening module and neologisms learning training module, guarantees the novel performance of neologisms and legal performance.Ensure that the neologisms input net new word material module of above-mentioned performance, carry out stored record, then by new word discovery module and neologisms output module, collating sort is carried out to neologisms, form an independently product, there is good market popularization value.
Further improvement of the present invention is, the game client in neologisms acquisition module is webpage client or APP client or application client.The interest of new words extraction can be improved well, facilitate the neologisms of transfer crowd to submit enthusiasm to.
Further improvement of the present invention is, net new word material module comprises the storage submodule for storing and the editor's submodule for editing neologism, editor's submodule generally comprises extraction module, sort module, load module, preservation module and output module, thus ensure that it has good editability energy, be convenient to manual intervention, thus guarantee the quality that net new word material module has had.
Based on the network new word discovery method of mass-rent technology, comprise the steps:
S1: neologisms obtain, by game client towards general population, general population is by game client participation network mark neologisms;
S2: neologisms screen, the network neologisms of screening general population mark, deduplication vocabulary and responsive vocabulary;
S3: neologisms learning training, adopts support vector machine classifier to carry out incremental learning training using network neologisms as seed;
S4: net new word material storehouse, stores and records neologisms network, is inputted net new word material storehouse;
S5: form new word discovery module, forms new word discovery module, and according to neologisms category classification, category forms new word discovery module;
S6: form neologisms output modules, arranges and stores the neologisms found, make neologisms according to the formation of the classification of its correspondence independent etc. neologisms to be called, form neologisms output module.
Further improvement of the present invention is, the new word discovery module that step S6 obtains is classified according to industrial sectors of national economy, and arranges storage, so that the packing in later stage and use.
Further improvement of the present invention is, net new word material module comprises storage submodule and editor's submodule, neologisms in editor's submodule editing network neologisms language material module, and be stored in storage submodule, etc. to be called and editor, editability can be strong, and the network neologisms for the later stage screen and quality is checked on further.
Further improvement of the present invention is, the game client in neologisms acquisition module is webpage client or APP client or application client.The interest of new words extraction can be improved well, facilitate the neologisms of transfer crowd to submit enthusiasm to.
Compared with prior art, the invention has the beneficial effects as follows: along with the universal of internet and the development of technology, people more and more depend on network, Internet chat and shopping are day by day risen, thereupon, having occurred a large amount of network neologisms, therefore found that network neologisms become an important research content of natural language research field, is the correlative study of language material resource with network text by impact on the accuracy rate of network new word identification.
Network new word discovery devices and methods therefor based on mass-rent technology of the present invention, adopts game client as the acquisition end of neologisms acquisition module, saves suitable human cost.
And it is filtered and rough handling the neologisms that masses submit to by automatic neologisms screening module and neologisms learning training module, guarantees the novel performance of neologisms and legal performance.
Ensure that the neologisms input net new word material module of above-mentioned performance, carry out stored record, then by new word discovery module and neologisms output module, collating sort is carried out to neologisms, form an independently product, there is good market popularization value.
Network new word discovery device of the present invention and method, shorten the cycle of new word discovery, and decrease the input of cost of labor, thus reduce the cost of new word discovery.
Accompanying drawing explanation
Fig. 1 is the structured flowchart of an embodiment of network new word discovery device based on mass-rent technology;
Fig. 2 is the process flow diagram of an embodiment of network new word discovery method based on mass-rent technology.
Embodiment
In order to deepen the understanding of the present invention, below in conjunction with drawings and Examples, the present invention is further described, and this embodiment, only for explaining the present invention, not forming protection scope of the present invention and limiting.
As shown in Figure 1, based on the network new word discovery device of mass-rent technology, comprise connect successively neologisms acquisition module, neologisms screening module, learning training module, net new word material storehouse, new word discovery module and neologisms output module; Wherein:
Neologisms acquisition module, by game client towards general population, general population is by game client participation network mark neologisms;
Neologisms screening module, the network neologisms of screening general population mark, remove repeated vocabulary and responsive vocabulary;
Learning training module, using network neologisms as seed, input support vector machine classifier carries out incremental learning training;
Net new word material module, stores and records neologisms network, is inputted net new word material storehouse;
Neologisms are classified according to classification by new word discovery module in net new word material storehouse, and find network neologisms according to pre-defined algorithm;
Neologisms output module, arranges and stores the network neologisms found, makes the neologisms of each classification of new word discovery module form the neologisms to be called such as independent, exports from output module.Network new word discovery devices and methods therefor based on mass-rent technology of the present invention, adopts game client as the acquisition end of neologisms acquisition module, saves suitable human cost.And it is filtered and rough handling the neologisms that masses submit to by automatic neologisms screening module and neologisms learning training module, guarantees the novel performance of neologisms and legal performance.Ensure that the neologisms input net new word material module of above-mentioned performance, carry out stored record, then by new word discovery module and neologisms output module, collating sort is carried out to neologisms, form an independently product, there is good market popularization value.
In the above-described embodiments, the game client in neologisms acquisition module is webpage client or APP client or application client.The interest of new words extraction can be improved well, facilitate the neologisms of transfer crowd to submit enthusiasm to.
In the above-described embodiments, net new word material module comprises the storage submodule for storing and the editor's submodule for editing neologism, editor's submodule generally comprises extraction module, sort module, load module, preservation module and output module, thus ensure that it has good editability energy, be convenient to manual intervention, thus guarantee the quality that net new word material module has had.
As shown in Figure 2, based on the network new word discovery method of mass-rent technology, comprise the steps:
S1: neologisms obtain, by game client towards general population, general population is by game client participation network mark neologisms;
S2: neologisms screen, the network neologisms of screening general population mark, remove repeated vocabulary and responsive vocabulary;
S3: neologisms learning training, adopts support vector machine classifier to carry out incremental learning training using network neologisms as seed;
S4: set up net new word material storehouse, stores and records network neologisms, is inputted net new word material storehouse;
S5: form new word discovery module, forms new word discovery module, and according to neologisms category classification, category forms new word discovery module;
S6: form neologisms output modules, arranges and stores the neologisms found, make neologisms according to the formation of the classification of its correspondence independent etc. neologisms to be called, form neologisms output module.
In the above-described embodiments, the new word discovery module that step S6 obtains is classified according to industrial sectors of national economy, and arranges storage, so that the packing in later stage and use.
In the above-described embodiments, net new word material module comprises storage submodule and editor's submodule, neologisms in editor's submodule editing network neologisms language material module, and be stored in storage submodule, etc. to be called and editor, editability can be strong, and the network neologisms for the later stage screen and quality is checked on further.
In the above-described embodiments, the game client in neologisms acquisition module is webpage client or APP client or application client.The interest of new words extraction can be improved well, facilitate the neologisms of transfer crowd to submit enthusiasm to.
Compared with prior art, the invention has the beneficial effects as follows: along with the universal of internet and the development of technology, people more and more depend on network, Internet chat and shopping are day by day risen, thereupon, having occurred a large amount of network neologisms, therefore found that network neologisms become an important research content of natural language research field, is the correlative study of language material resource with network text by impact on the accuracy rate of network new word identification.
Network new word discovery devices and methods therefor based on mass-rent technology of the present invention, adopts game client as the acquisition end of neologisms acquisition module, saves suitable human cost.
And it is filtered and rough handling the neologisms that masses submit to by automatic neologisms screening module and neologisms learning training module, guarantees the novel performance of neologisms and legal performance.
Ensure that the neologisms input net new word material module of above-mentioned performance, carry out stored record, then by new word discovery module and neologisms output module, collating sort is carried out to neologisms, form an independently product, there is good market popularization value.
Network new word discovery device of the present invention and method, shorten the cycle of new word discovery, and decrease the input of cost of labor, thus reduce the cost of new word discovery.
What embodiments of the invention were announced is preferred embodiment, but is not limited thereto, those of ordinary skill in the art; very easily according to above-described embodiment, understand spirit of the present invention, and make different amplifications and change; but only otherwise depart from spirit of the present invention, all in protection scope of the present invention.

Claims (8)

1., based on the network new word discovery device of mass-rent technology, it is characterized in that, comprise connect successively neologisms acquisition module, neologisms screening module, learning training module, net new word material storehouse, new word discovery module and neologisms output module; Wherein:
Neologisms acquisition module, by game client towards general population, general population is by game client participation network mark neologisms;
Neologisms screening module, the network neologisms of screening general population mark, remove repeated vocabulary and responsive vocabulary;
Learning training module, using network neologisms as seed, input support vector machine classifier carries out incremental learning training;
Net new word material module, stores and records network neologisms, is inputted net new word material storehouse;
Neologisms are classified according to classification by new word discovery module in net new word material storehouse, and find neologisms according to pre-defined algorithm;
Neologisms output module, arranges and stores the neologisms found, makes the neologisms of each classification of new word discovery module form the neologisms to be called such as independent, and exports the network neologisms found.
2. the network new word discovery device based on mass-rent technology according to claim 1, is characterized in that: the game client in described neologisms acquisition module is webpage client or APP client or application client.
3. the network new word discovery device based on mass-rent technology according to claim 1, is characterized in that: described net new word material module comprises storage submodule and editor's submodule.
4. the network new word discovery device based on mass-rent technology according to claim 3, is characterized in that: described editor's submodule comprises extraction module, sort module, load module, preservation module and output module.
5., based on the network new word discovery method of mass-rent technology, it is characterized in that, comprise the steps:
S1: neologisms obtain, by game client towards general population, general population is by game client participation network mark neologisms;
S2: neologisms screen, the network neologisms of screening general population mark, remove repeated vocabulary and responsive vocabulary;
S3: neologisms learning training, adopts support vector machine classifier to carry out incremental learning training using network neologisms as seed;
S4: set up net new word material storehouse, stores and records network neologisms, is inputted net new word material storehouse;
S5: form new word discovery module, forms new word discovery module, and according to neologisms category classification, category forms new word discovery module;
S6: form neologisms output modules, arranges and stores the neologisms found, make neologisms according to the formation of the classification of its correspondence independent etc. neologisms to be called, form neologisms output module.
6. the network new word discovery method based on mass-rent technology according to claim 5, is characterized in that: the new word discovery module that described step S6 obtains is classified according to industrial sectors of national economy, and arranges storage.
7. the network new word discovery device based on mass-rent technology according to claim 5, it is characterized in that: described net new word material module comprises storage submodule and editor's submodule, neologisms in described editor's submodule editing network neologisms language material module, and be stored in storage submodule.
8. the network new word discovery device based on mass-rent technology according to claim 5, is characterized in that: the game client in described neologisms acquisition module is webpage client or APP client or application client.
CN201510785868.1A 2015-11-16 2015-11-16 Network new word discovery device and method based on crowdsourcing technology Pending CN105260482A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510785868.1A CN105260482A (en) 2015-11-16 2015-11-16 Network new word discovery device and method based on crowdsourcing technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510785868.1A CN105260482A (en) 2015-11-16 2015-11-16 Network new word discovery device and method based on crowdsourcing technology

Publications (1)

Publication Number Publication Date
CN105260482A true CN105260482A (en) 2016-01-20

Family

ID=55100172

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510785868.1A Pending CN105260482A (en) 2015-11-16 2015-11-16 Network new word discovery device and method based on crowdsourcing technology

Country Status (1)

Country Link
CN (1) CN105260482A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106528523A (en) * 2016-09-22 2017-03-22 中山大学 Network neologism identification method
CN107291722A (en) * 2016-03-30 2017-10-24 阿里巴巴集团控股有限公司 The sorting technique and equipment of a kind of descriptor
CN111274404A (en) * 2020-02-12 2020-06-12 杭州量知数据科技有限公司 Small sample entity multi-field classification method based on man-machine cooperation

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1637744A (en) * 2004-01-09 2005-07-13 微软公司 Machine-learned approach to determining document relevance for search over large electronic collections of documents
CN101261623A (en) * 2007-03-07 2008-09-10 国际商业机器公司 Word splitting method and device for word border-free mark language based on search

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1637744A (en) * 2004-01-09 2005-07-13 微软公司 Machine-learned approach to determining document relevance for search over large electronic collections of documents
CN101261623A (en) * 2007-03-07 2008-09-10 国际商业机器公司 Word splitting method and device for word border-free mark language based on search

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
孙晓等: "基于深层结构模型的新词发现与情感倾向判定", 《计算机科学》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107291722A (en) * 2016-03-30 2017-10-24 阿里巴巴集团控股有限公司 The sorting technique and equipment of a kind of descriptor
CN107291722B (en) * 2016-03-30 2020-12-04 阿里巴巴集团控股有限公司 Descriptor classification method and device
CN106528523A (en) * 2016-09-22 2017-03-22 中山大学 Network neologism identification method
CN106528523B (en) * 2016-09-22 2019-05-10 中山大学 A kind of network new word identification method
CN111274404A (en) * 2020-02-12 2020-06-12 杭州量知数据科技有限公司 Small sample entity multi-field classification method based on man-machine cooperation

Similar Documents

Publication Publication Date Title
CN106909654B (en) Multi-level classification system and method based on news text information
CN107766371A (en) A kind of text message sorting technique and its device
CN105843965B (en) A kind of Deep Web Crawler form filling method and apparatus based on URL subject classification
CN104346354B (en) It is a kind of that the method and device for recommending word is provided
CN105260482A (en) Network new word discovery device and method based on crowdsourcing technology
CN104317970A (en) Data flow type processing method based on data processing center
CN106033462A (en) Neologism discovering method and system
CN102509001B (en) Method for automatically removing time sequence data outlier point
CN113495959B (en) Financial public opinion identification method and system based on text data
CN104778208A (en) Method and system for optimally grasping search engine SEO (search engine optimization) website data
CN105426358A (en) Automatic disease noun identification method
CN102708164A (en) Method and system for calculating movie expectation
CN103049581A (en) Web text classification method based on consistency clustering
CN107305555A (en) Data processing method and device
CN106227770B (en) A kind of intelligentized news web page information extraction method
CN108304509A (en) A kind of comment spam filter method for indicating mutually to learn based on the multidirectional amount of text
JP2017532675A5 (en)
WO2022262586A1 (en) Method for plant identification, computer system and computer-readable storage medium
CN110599232A (en) Consumption group analysis method based on big data
CN104182387A (en) Text emotional tendency analysis system
CN104573101B (en) A kind of data flow real-time grading method and system of rule-based route
CN104361061A (en) WEB page information sensing and collecting method
CN105956070A (en) Method and system for integrating repetitive records
CN104331507A (en) Method and device for automatically finding and classifying machine data categories
CN108255880B (en) Data processing method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Liang Yinghong

Inventor before: Liang Yinghong

Inventor before: Xu Nan

Inventor before: Yang Ronggen

COR Change of bibliographic data
RJ01 Rejection of invention patent application after publication

Application publication date: 20160120

RJ01 Rejection of invention patent application after publication