CN105354262A - Message text label extraction method - Google Patents

Message text label extraction method Download PDF

Info

Publication number
CN105354262A
CN105354262A CN201510697001.0A CN201510697001A CN105354262A CN 105354262 A CN105354262 A CN 105354262A CN 201510697001 A CN201510697001 A CN 201510697001A CN 105354262 A CN105354262 A CN 105354262A
Authority
CN
China
Prior art keywords
identity
label
label information
note
service
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510697001.0A
Other languages
Chinese (zh)
Inventor
章宦记
王建
庞彦伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN201510697001.0A priority Critical patent/CN105354262A/en
Publication of CN105354262A publication Critical patent/CN105354262A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model

Abstract

The present invention relates to a message text label extraction method. The method comprises: for an existing message text, mining a notification-type message by compiling a regular expression; using mined XX as identity label information of the message text; for a mined notification-type message text identity of this kind, taking identity label information with the highest frequency as final identity label information of a service number in a manner of taking a threshold; and performing time update. The message text label extraction method provided by the present invention can achieve rapid update and iteration.

Description

A kind of extraction short message text stamp methods
Art
The invention relates to the application of natural language processing short message text aspect, extracted the stamp methods of short message text by design, realize the classification to short message text.
Background technology
In recent years, in natural language processing, analytical approach of the present invention is emerged in an endless stream, but to analyze text be all need certain language material basis, go to analyze the content needing to solve by the language material of existing label.Before the text message that process is a large amount of, usually by manually going to mark a certain amount of text message, such as mark the theme of the text, word number etc.This is a very time-consuming process, and the regular one or two months that there will be just just marked considerably less a part of data.Particularly in the process of classifying to short message text, need the short message text of a large amount of existing labels to carry out model training, then adopt test data to carry out analysis verification to the model trained.But to the short message text of label that this department pattern is trained, be also often obtained by the mode of handmarking, waste time and energy.
Summary of the invention
The feature that the present invention is based on notice class short message text provides one without handmarking, and can realize by automatized script the mode notifying class short message text mark.The technical scheme adopted is:
A kind of extraction short message text stamp methods, comprises following several aspect:
Regular expression module: to existing short message text, the first, according to service number, specifying which note is belong to notice class note, excavates all notice class notes by writing regular expression; The second, the position occurred according to the identity of notice class note and Text Mode, excavate identity, if the identity information excavated is XX, using the XX that excavates as the identity label information of short message text, if excavate less than, then this note does not have corresponding identity label information;
Threshold module: to this kind of notice class short message text identity excavated, by getting the mode of threshold value, get the final identity label information that the highest identity label information of frequency is used as this service number, if a service number corresponding note does not excavate identity label information, then this service number does not have corresponding label; If the ratio of a service number number for corresponding identity label information that the note quantity sent is excavated and the note quantity of transmission is less than some threshold values, then also think that this service number does not have corresponding identity label information;
Time update module: at set intervals, contrasted according to the service label information that this service of nearest a period of time number identity label information extracted and a upper time period are preserved, if the label information of a period of time extraction recently is relatively concentrated as certain label, this label is different from the label that a upper time period extracts simultaneously, then automatically upgrade the service identity label information excavated, using the identity label of current label as this service number, otherwise the service identity label retaining a time period is constant.
The invention has the beneficial effects as follows: the present invention is according to the characteristic of notice class text itself, regular expression is adopted to excavate text itself according to keyword bracket pattern and short message content pattern, relevant information is extracted in conjunction with the historical statistics distribution of text again based on text self, avoid the deviation of hand digging, service number corresponding multiple content of text are carried out identity excavation, excavate multiple candidate information, avoid manual retrieval, the trouble of investigation, the identity that finally selection frequency is the highest is as service identity information.Adopt other application of language material process of mark to provide enough materials for follow-up simultaneously.And employing automatized script, can, on the production line of product, realize upgrading fast and iteration.
Embodiment
Notice class short message text is generally with the services number of 106 numeral beginnings, and the content that in short message text, often beginning occurs in bracket or round bracket simultaneously shows the identity of note, or occurs in short message content " XX reminds you, welcome to send a telegraph XX " isotype.By the analysis to these situations, content in excavation bracket and the XX in short message text are as the label notifying class short message text, wasting time and energy of handmarking can be reduced greatly, also can reduce error that some artificial perceptual knowledge cause thus improve the degree of accuracy of notice class note identification.Technical scheme of the present invention is as follows:
Regular expression module: to existing short message text, first, specifying which note is belong to notice class note, being generally that the service note of 106 numeral beginnings thinks notice class note, excavating the notice class note of all 106 beginnings by writing regular expression.Second, the identity of notice class note generally appears in the beginning of short message text and the bracket of ending place, if beginning and end place does not have bracket simultaneously, whether then judge to excavate in short message text content has " XX reminds you; and XX notifies that you " wait Text Mode information, using the XX that the excavates identity label information as short message text.If do not have above two kinds of situations in short message text, then this note does not have corresponding identity information.
Threshold module: to the notice class short message text identity excavated, one 106 service number may corresponding multiple identity label information, by getting the mode of threshold value, gets the final identity label information that the highest identity label information of frequency is used as this service number.If a service number corresponding note does not excavate identity label information, then this service number does not have corresponding label.If the ratio of the digital services of one the 106 beginning number number of corresponding identity information that the note quantity sent is excavated and the note quantity of transmission is less than some threshold values, then also think that the digital services number of this 106 beginning does not have corresponding label.
Time update module: because the digital services note of 106 beginnings may be bought by different companies at set intervals, therefore at set intervals, the service label information that time update module can be preserved according to this service of nearest a period of time number label information extracted and a upper time period contrasts, if the label information of a period of time extraction recently is relatively concentrated as certain label, this label is different from the label that a upper time period extracts simultaneously, then time update module can upgrade the service identity information excavated automatically, using the identity label of current label as this service number, otherwise the service identity label retaining a time period is constant.
Below in conjunction with embodiment, the present invention will be described.
Collect now a large amount of note datas and have following some forms:
106123456, [Talent Management] invites you to participate in campus recruiting, 2010.05.11.106123456, [Talent Management] invites you to participate in campus recruiting, 2010.05.11.106123456, [Talent Management] invites you to participate in campus recruiting, 2010.05.11.Amount to 100 data, the bracket of every bar note has " Talent Management " this label
10678456, [top property] welcomes you to go home, 2010.05.11.10678456, you are given by top commodity, peace are noted on road, 2010.05.11 ... 10678456, thank you comes my company, 2010.05.11.Wherein there are 50 of bracket note, without 50 of bracket note
1065678, welcome you to dial Jin Lin hotel, ask information desk in detail, 2010.05.11.1065678, welcome you to dial Jin Lin hotel, please foreground be seek advice from detail, 2010.05.11.1065678, welcome you to dial Jin Lin hotel, request 3344556677,2010.05.11 in detail.Amount to 100 note datas, every bar note has " welcome you to dial Jin Lin hotel " pattern
1065678, welcome you to dial Shanxi oodle shop, ask information desk in detail, 2010.06.11.1065678, welcome you to dial Shanxi oodle shop, please foreground be seek advice from detail, 2010.06.11.1065678, welcome you to dial Shanxi oodle shop, request 3344556677,2010.06.11 in detail.Amount to 100 note datas, every bar note has " welcome you to dial Shanxi oodle shop " pattern
106778899, [Talent Management] invites you to participate in campus recruiting, 2010.06.11.106778899, the friendly special recruitment fair of [Talent Management] expect you to participating vehicle, 2010.06.11.106778899, [grand property] woulds you please pay dues in time, 2010.06.11.106778899, [Disease Center] woulds you please note weather condition, 2010.06.11.There is the tag identity in multiple bracket, " Talent Management " having 80 marks, " grand property " having 10 marks, " Disease Center " has 10 marks, amounts to 100.
For the service notes of 106 numeral beginnings above, after logical regular expression and the large module of threshold value two and time update module, service number can obtain a corresponding label, can retain a service number label that the nearest time is corresponding in time update module simultaneously.It can be " Talent Management " that services No. 106123456 obtain corresponding label.10678456 services number some notes sent may obtain " top property " this label, but generally speaking, if threshold value gets 0.6 note that can propose label at least will have 60, the foundation of correspondence markings just whether can be had as this service number, so this service number is obviously owing to there being the note number just 50 of bracket, note number 100 altogether, the requirement not reaching this threshold value can not have corresponding label.1065678, by short message content pattern, can match " Jin Lin hotel ".And spent a period of time, short message content that the note of 1065678 these services number sent in May, 2010 by " Jin Lin hotel " becomes in June, 2010 transmission " Shanxi oodle shop ", so originally retained in time update module by time update module this time " Jin Lin hotel " label and present " Shanxi oodle shop " label contrasts, due to " Shanxi oodle shop " note time of content is that June is than May " Jin Lin hotel " and note time of content closer to current time, 1065678 be labeled as can be obtained " Shanxi oodle shop ".And 106778899, although the label classification of correspondence has three classes, wherein " Talent Management " ratio accounted for is maximum reaches 80%, so 106778899 final corresponding labels are also " Talent Management ".

Claims (1)

1. extract a short message text stamp methods, comprise following several aspect:
Regular expression module: to existing short message text, the first, according to service number, specifying which note is belong to notice class note, excavates all notice class notes by writing regular expression; The second, the position occurred according to the identity of notice class note and Text Mode, excavate identity, if the identity information excavated is XX, using the XX that excavates as the identity label information of short message text, if excavate less than, then this note does not have corresponding identity label information;
Threshold module: to this kind of notice class short message text identity excavated, by getting the mode of threshold value, get the final identity label information that the highest identity label information of frequency is used as this service number, if a service number corresponding note does not excavate identity label information, then this service number does not have corresponding label; If the ratio of a service number number for corresponding identity label information that the note quantity sent is excavated and the note quantity of transmission is less than some threshold values, then also think that this service number does not have corresponding identity label information;
Time update module: at set intervals, contrasted according to the service label information that this service of nearest a period of time number identity label information extracted and a upper time period are preserved, if the label information of a period of time extraction recently is relatively concentrated as certain label, this label is different from the label that a upper time period extracts simultaneously, then automatically upgrade the service identity label information excavated, using the identity label of current label as this service number, otherwise the service identity label retaining a time period is constant.
CN201510697001.0A 2015-10-26 2015-10-26 Message text label extraction method Pending CN105354262A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510697001.0A CN105354262A (en) 2015-10-26 2015-10-26 Message text label extraction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510697001.0A CN105354262A (en) 2015-10-26 2015-10-26 Message text label extraction method

Publications (1)

Publication Number Publication Date
CN105354262A true CN105354262A (en) 2016-02-24

Family

ID=55330235

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510697001.0A Pending CN105354262A (en) 2015-10-26 2015-10-26 Message text label extraction method

Country Status (1)

Country Link
CN (1) CN105354262A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095745A (en) * 2016-05-27 2016-11-09 厦门市美亚柏科信息股份有限公司 Transaction record extracting method based on log and system thereof
CN107436922A (en) * 2017-07-05 2017-12-05 北京百度网讯科技有限公司 Text label generation method and device
CN108038154A (en) * 2017-12-05 2018-05-15 北京小米移动软件有限公司 Definite method, apparatus, equipment and the storage medium of contact identity information
CN109561402A (en) * 2017-09-26 2019-04-02 中国电信股份有限公司 Information acquisition method, device and mobile terminal

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120322471A1 (en) * 2011-06-16 2012-12-20 Hon Hai Precision Industry Co., Ltd. Mobile phone and method for processing short message
CN103428662A (en) * 2013-07-31 2013-12-04 广州市动景计算机科技有限公司 Short message information processing method and device
CN104301532A (en) * 2014-09-30 2015-01-21 小米科技有限责任公司 Communication message identification method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120322471A1 (en) * 2011-06-16 2012-12-20 Hon Hai Precision Industry Co., Ltd. Mobile phone and method for processing short message
CN103428662A (en) * 2013-07-31 2013-12-04 广州市动景计算机科技有限公司 Short message information processing method and device
CN104301532A (en) * 2014-09-30 2015-01-21 小米科技有限责任公司 Communication message identification method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SINCOOW: "【荣组儿】【新功能建议】通知类短信发件人规则自动匹配", 《MIUI米柚HTTP://WWW.MIUI.COM/FORUM.PHP?MOD=VIEWTHREAD&TID=1822143》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095745A (en) * 2016-05-27 2016-11-09 厦门市美亚柏科信息股份有限公司 Transaction record extracting method based on log and system thereof
CN107436922A (en) * 2017-07-05 2017-12-05 北京百度网讯科技有限公司 Text label generation method and device
CN107436922B (en) * 2017-07-05 2021-06-08 北京百度网讯科技有限公司 Text label generation method and device
CN109561402A (en) * 2017-09-26 2019-04-02 中国电信股份有限公司 Information acquisition method, device and mobile terminal
CN108038154A (en) * 2017-12-05 2018-05-15 北京小米移动软件有限公司 Definite method, apparatus, equipment and the storage medium of contact identity information

Similar Documents

Publication Publication Date Title
CN105354262A (en) Message text label extraction method
CN106874134B (en) Work order type processing method, device and system
CN104461863A (en) Service system testing method, device and system
CN108664474A (en) A kind of resume analytic method based on deep learning
CN109599114A (en) Method of speech processing, storage medium and device
CN109934227A (en) System for recognizing characters from image and method
CN103914546A (en) Data updating method and device thereof
CN102043716A (en) Automatic software testing method based on business driving
CN105808721A (en) Data mining based customer service content analysis method and system
CN105824756B (en) A kind of out-of-date demand automatic testing method and system based on code dependence
CN103324632B (en) A kind of concept identification method based on Cooperative Study and device
CN111522942B (en) Training method and device for text classification model, storage medium and computer equipment
CN103677821A (en) Method and device for publishing software development tool code
CN103729473A (en) Related software historical data extraction method based on LDA topic model
CN111325031B (en) Resume analysis method and device
CN105183742A (en) Resume identification method
CN109858025A (en) A kind of segmenting method and system of Address Standardization corpus
CN110688856B (en) Referee document information extraction method
CN104021180A (en) Combined software defect report classification method
CN111008706A (en) Processing method for automatically labeling, training and predicting mass data
CN110880020B (en) Self-adaptive trans-regional base station energy consumption model migration and compensation method
CN102103502A (en) Method and system for analyzing a legacy system based on trails through the legacy system
CN105373473B (en) CDR accuracys method of testing and test system based on original signaling decoding
CN109272007B (en) Initial supporting force and terminal resistance identification method based on deep neural network and storage medium
CN109767031A (en) Model classifiers method for building up, device, computer equipment and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20160224

WD01 Invention patent application deemed withdrawn after publication