CN106407175A - Method and device for processing character strings in new word discovery - Google Patents

Method and device for processing character strings in new word discovery Download PDF

Info

Publication number
CN106407175A
CN106407175A CN201510463437.3A CN201510463437A CN106407175A CN 106407175 A CN106407175 A CN 106407175A CN 201510463437 A CN201510463437 A CN 201510463437A CN 106407175 A CN106407175 A CN 106407175A
Authority
CN
China
Prior art keywords
data
character strings
position data
candidate character
character string
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510463437.3A
Other languages
Chinese (zh)
Inventor
何鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201510463437.3A priority Critical patent/CN106407175A/en
Publication of CN106407175A publication Critical patent/CN106407175A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and device for processing character strings in new word discovery. The method comprises the following steps of: determining a to-be-processed text, wherein the to-be-processed text comprises at least one bleaching character string and at least one candidate character string, the bleaching character string is a character string for forming new words in the to-be-processed text, and the candidate character string is a character string for forming candidate new words in the to-be-processed text; obtaining a relation of subordination between first position data and second position data, wherein the first position data is used for representing a position of the bleaching character string in the to-be-processed text and the second position data is used for representing a position of the candidate character string in the to-be-processed text; and filtering the candidate character string in the to-be-processed text according to the relation of subordination between the first position data and the second position data. Through the method and device disclosed by the invention, the problem that the new word discovery correctness is influenced by invalid candidate character strings in new word discovery tasks in correlation techniques is solved.

Description

The processing method and processing device of character string in new word discovery
Technical field
The present invention relates to new word discovery technical field, in particular to a kind of process side of character string in new word discovery Method and device.
Background technology
New word discovery is basis and the core technology of natural language processing, and generally, the processing method of new word discovery is to utilize Between words, the point statistic such as mutual information and left and right comentropy becomes the probability of word judging multiple continuation characters, Jin Ershi Other neologisms.Need to firstly generate candidate character strings before Counting statistics amount, candidate character strings are the unequal companies of length The combination of continuous character.After system-computed judges that certain candidate character strings meets neologisms condition, this candidate character strings all Substring just shall not continue to participate in calculating as neologisms.
For example, " natural language processing " one word is judged to neologisms by system.The substring of this word, such as " natural language Place ", " right Language Processing " etc. all should be used as invalid candidate word and be filtered, in addition, such as " natural language " Deng substring although also there being clear and definite semanteme, being a word, but being subordinated to " natural language processing " Rise when occurring, the statistics of substring should also be as being rejected.
But there are candidate character strings invalid in a large number in correlation technique new word discovery task, can be to big during Counting statistics amount Measure invalid candidate character strings to be counted, the efficiency of impact new word discovery.Further, since in substring set Much bigger than neologisms itself of mistake word ratio, without filter just carry out data statisticss when, neologisms can be directly affected The accuracy rate of discovery task.
Affect new word discovery accuracy rate in correlation technique new word discovery task due to there are invalid candidate character strings Problem, not yet proposes effective solution at present.
Content of the invention
Present invention is primarily targeted at providing a kind of processing method and processing device of character string in new word discovery, to solve phase Due to there is invalid candidate character strings impact new word discovery accuracy rate in the technology new word discovery task of pass.
To achieve these goals, according to an aspect of the invention, it is provided in a kind of new word discovery character string place Reason.The method includes:Determine pending text, wherein, pending text includes at least one and becomes word character string and extremely Few candidate character strings, become word character string to be to be used in pending text forming the character string of neologisms, candidate character strings It is to be used in pending text forming the character string of candidate's neologisms;Obtain primary importance data and second position data from Genus relation, wherein, primary importance data is the data for being expressed as word character string position in pending text, the Two position datas are the data for representing candidate character strings position in pending text;And according to primary importance number According to the membership relation of second position data, filtration treatment is carried out to candidate character strings in pending text.
Further, after determining pending text, in the subordinate obtaining primary importance data and second position data Before relation, the method also includes:Obtain list of locations, wherein, list of locations is primary importance data and second Put the list of data composition;According to pre-conditioned, position data in list of locations is ranked up, obtains position data Set;Obtain the first sorting data of each one-tenth word character string and the second sorting data of each candidate character strings, wherein, First sorting data is the data of affiliated one-tenth word character string start-stop position in the data acquisition system of position, and the second sorting data is The data of affiliated candidate character strings start-stop position in the data acquisition system of position, obtains primary importance data and second position number According to membership relation include:Judge whether each second sorting data is included at least one first sorting data, root Membership relation according to primary importance data and second position data carries out filtration treatment to candidate character strings in pending text Including:Filtration treatment is carried out to the corresponding candidate character strings of the second sorting data being included in the first sorting data.
Further, after determining pending text, in the subordinate obtaining primary importance data and second position data Before relation, the method also includes:Create station location marker, station location marker includes:Each becomes word character string pending In text, the first of starting position starts to identify, and each becomes the first knot of word character string end position in pending text Beam identification, each candidate character strings the second of starting position starts to identify in pending text, and each candidate word Second end of identification of symbol string end position in pending text.
Further, according to pre-conditioned, position data in list of locations is being ranked up, is obtaining position data collection After conjunction, the method also includes:The state of detection inquiry switch, wherein, inquiry switch is used for query candidate character string, The state of inquiry switch includes the first inquiry state and the second inquiry state, and the first inquiry state is for representing when detection To first start to identify when open the state of query candidate character string, the second inquiry state is for representing when detecting the Stop the state of query candidate character string during one end of identification;If state inquiry switch is detected is the first inquiry shape State, detection the first sorting position data whether there is station location marker, and wherein, the first sorting position data is position data Current data in set;When station location marker is detected and station location marker starts mark for second, start to mark second First labelling is created on knowledge;When station location marker is detected and station location marker is the second end of identification, judge the second end The starting position identifying corresponding candidate character strings marks whether as the first labelling;If the corresponding time of the second end of identification The starting position selecting character string is labeled as the first labelling, then candidate character strings are screened out;Determine the second sorting position Data, wherein, the second sorting position data is next data of current data in position data set;Second is sorted Position data is as current data;And the first sorting position data is redefined according to current data, repeat inspection The step checking and examining the state asking switch, until traversal completes position data set.
Further, when detecting that the first sorting position data whether there is station location marker, the method also includes:Work as inspection When to measure station location marker and station location marker be the first end of identification, the state of change inquiry switch is the second inquiry state; When the state of inquiry switch is when being the second inquiry state from the first inquiry Status Change, if there is at least one candidate word The starting position of symbol string is marked with the first labelling and is not detected by the second end of identification of this candidate character strings, then change should The first of candidate character strings is labeled as the second labelling, and wherein, second is labeled as being not detected by candidate character strings for representing The labelling of starting position in pending text.
Further, after the state of detection inquiry switch, the method also includes:If inquiry switch is detected State is the second inquiry state, then search next one-tenth word character string in position data set first starts to identify;When Find next become first when starting mark of word character string, the state of change inquiry switch is the first inquiry state; And again detect that the first sorting position data whether there is station location marker, until traversal completes position data set.
To achieve these goals, according to a further aspect in the invention, there is provided the place of character string in a kind of new word discovery Reason device.This device includes:First determining unit, for determining pending text, wherein, pending text includes At least one becomes word character string and at least one candidate character strings, becomes word character string to be to be used in pending text forming newly The character string of word, candidate character strings are to be used in pending text forming the character string of candidate's neologisms;First acquisition unit, For obtaining the membership relation of primary importance data and second position data, wherein, primary importance data is for representing Become the data of word character string position in pending text, second position data is for representing that candidate character strings are waiting to locate The data of position in reason text;And processing unit, for the subordinate according to primary importance data and second position data In the pending text of relation pair, candidate character strings carry out filtration treatment.
Further, this device also includes:Second acquisition unit, for obtaining list of locations, wherein, list of locations List for primary importance data and second position data composition;Sequencing unit, for according to pre-conditioned to location column Position data in table is ranked up, and obtains position data set;3rd acquiring unit, becomes word word for obtaining each First sorting data of symbol string and the second sorting data of each candidate character strings, wherein, the first sorting data is affiliated Become the data of word character string start-stop position in the data acquisition system of position, the second sorting data is that affiliated candidate character strings are in place Put the data of start-stop position in data acquisition system, first acquisition unit is additionally operable to judge whether each second sorting data comprises In at least one first sorting data, processing unit is additionally operable to the second row ordinal number being included in the first sorting data Carry out filtration treatment according to corresponding candidate character strings.
Further, this device also includes:First creating unit, for creating station location marker, station location marker includes: Each become word character string in pending text the first of starting position start identify, each become word character string pending First end of identification of end position in text, each candidate character strings the second of starting position is opened in pending text Begin mark, and the second end of identification of each candidate character strings end position in pending text.
Further, this device also includes:First detector unit, for the state of detection inquiry switch, wherein, looks into Ask switch and be used for query candidate character string, the state of inquiry switch includes the first inquiry state and the second inquiry state, the One inquiry state is that second looks into for representing the state opening query candidate character string when detecting first and starting to identify Inquiry state is for representing the state stopping query candidate character string when the first end of identification is detected;Second detection is single Unit, in the case of being the first inquiry state in state inquiry switch is detected, detects the first sorting position data With the presence or absence of station location marker, wherein, the first sorting position data is the current data in position data set;Second wound Build unit, for when station location marker is detected and station location marker starts mark for second, starting to create in mark second Build the first labelling;Judging unit, for when station location marker is detected and station location marker is the second end of identification, judging The starting position of the corresponding candidate character strings of the second end of identification marks whether as the first labelling;Screen out unit, for In the case that the starting position of the corresponding candidate character strings of the second end of identification is labeled as the first labelling, then to candidate characters String is screened out;Second determining unit, for determining the second sorting position data, wherein, the second sorting position data Next data for current data in position data set;3rd determining unit, for making the second sorting position data For current data;And the 4th determining unit, for the first sorting position data is redefined according to current data, weight The step of the state of multiple perform detection inquiry switch, until traversal completes position data set.
In embodiments of the present invention, due to the membership relation by obtaining primary importance data and second position data, root Membership relation according to primary importance data and second position data carries out filtration treatment to candidate character strings in pending text, Solving in correlation technique new word discovery task affects asking of new word discovery accuracy rate due to there are invalid candidate character strings Topic, and then reached the effect of the accuracy rate of new word discovery in lifting new word discovery task.
Brief description
The accompanying drawing constituting the part of the application is used for providing a further understanding of the present invention, the schematic reality of the present invention Apply example and its illustrate, for explaining the present invention, not constituting inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 is the flow chart of the processing method of character string in new word discovery according to embodiments of the present invention;And
Fig. 2 is the schematic diagram of the processing meanss of character string in new word discovery according to embodiments of the present invention.
Specific embodiment
It should be noted that in the case of not conflicting, the embodiment in the application and the feature in embodiment can phases Mutually combine.To describe the present invention below with reference to the accompanying drawings and in conjunction with the embodiments in detail.
In order that those skilled in the art more fully understand application scheme, below in conjunction with the embodiment of the present application Accompanying drawing, is clearly and completely described the embodiment it is clear that described to the technical scheme in the embodiment of the present application It is only the embodiment of the application part, rather than whole embodiments.Based on the embodiment in the application, ability The every other embodiment that domain those of ordinary skill is obtained under the premise of not making creative work, all should belong to The scope of the application protection.
It should be noted that term " first " in the description and claims of this application and above-mentioned accompanying drawing, " Two " it is etc. for distinguishing similar object, without for describing specific order or precedence.It should be appreciated that this The data that sample uses can be exchanged, in the appropriate case so that embodiments herein described herein.Additionally, term " comprising " and " having " and their any deformation, it is intended that covering non-exclusive comprising, for example, comprise The process of series of steps or unit, method, system, product or equipment are not necessarily limited to those steps clearly listed Rapid or unit, but may include clearly not listing or intrinsic for these processes, method, product or equipment Other steps or unit.
According to embodiments of the invention, there is provided the processing method of character string in a kind of new word discovery.
Fig. 1 is the flow chart of the processing method of character string in new word discovery according to embodiments of the present invention.As shown in figure 1, The method includes steps S101 to step S103:
Step S101, determines pending text.
Specifically, in above-mentioned steps S101, determine pending text, wherein, pending text includes at least one Become word character string and at least one candidate character strings, become word character string to be to be used in pending text forming the character of neologisms String, candidate character strings are to be used in pending text forming the character string of candidate's neologisms.
It should be noted that before certain character string of system identification is neologisms, being that cannot be distinguished by which character out String is into word character string, and which character string is candidate character strings, therefore, at the beginning of statistics, the mark of all character strings It is designated as a character string list.After certain character string of system identification is neologisms, original character string list will It is divided into two, be divided into into word character string and candidate character strings.
Step S102, obtains the membership relation of primary importance data and second position data.
Specifically, in above-mentioned steps S102, above-mentioned primary importance data is to wait to locate for being expressed as word character string The data of position in reason text, above-mentioned second position data is for representing candidate character strings in pending text middle position The data put.According to become word character string in pending text position data and candidate character strings in pending text middle position Put data and determine primary importance data and the relation of second position data, then determine the primary importance number belonging to membership relation According to second position data.Get primary importance data and the membership relation of second position data.
Step S103, the membership relation according to primary importance data and second position data is to candidate word in pending text Symbol string carries out filtration treatment.
The processing method of character string in new word discovery provided in an embodiment of the present invention, by determining pending text, wherein, Pending text includes at least one and becomes word character string and at least one candidate character strings, becomes word character string to be pending literary composition For forming the character string of neologisms in this, candidate character strings are to be used in pending text forming the character string of candidate's neologisms; Obtain the membership relation of primary importance data and second position data, wherein, primary importance data is for being expressed as word The data of character string position in pending text, second position data is for representing candidate character strings in pending literary composition The data of position in this;And according to the membership relation of primary importance data and second position data in pending text Candidate character strings carry out filtration treatment, solve in correlation technique new word discovery task due to there are invalid candidate characters String shadow rings the problem of new word discovery accuracy rate, and then has reached the accuracy rate of new word discovery in lifting new word discovery task Effect.
Alternatively, in order to lift the efficiency that candidate character strings in pending text are carried out with filtration treatment, real in the present invention Apply in the processing method of character string in the new word discovery of example offer, after determining pending text, obtaining first Before putting data and the membership relation of second position data, the method also includes:Obtain list of locations, wherein, position List is primary importance data and the list of second position data composition;According to pre-conditioned to the position in list of locations Data is ranked up, and obtains position data set;Obtain the first sorting data and each candidate that each becomes word character string Second sorting data of character string, wherein, the first sorting data is that affiliated one-tenth word character string rises in the data acquisition system of position The data that stop bit is put, the second sorting data is the data of affiliated candidate character strings start-stop position in the data acquisition system of position, Obtain primary importance data and the membership relation of second position data includes:Judge whether each second sorting data comprises In at least one first sorting data, the membership relation according to primary importance data and second position data is to pending In text, candidate character strings carry out filtration treatment and include:The second sorting data being included in the first sorting data is corresponded to Candidate character strings carry out filtration treatment.
In the processing method of character string in new word discovery provided in an embodiment of the present invention, after determining pending text, Before obtaining the membership relation of primary importance data and second position data, the method also includes:Create station location marker, Station location marker includes:Each become word character string in pending text the first of starting position start identify, each become word First end of identification of character string end position in pending text, each candidate character strings is opened in pending text The second of beginning position starts to identify, and the second end mark of each candidate character strings end position in pending text Know.
For example, in order to be distinguished into word character string and candidate character strings, make each one-tenth word character string is designated Y, often One candidate character strings be designated X.Specifically, in order to preferably be distinguished into word character string and candidate character strings, with Each becomes word character string and the positional information of candidate character strings to make a distinction, and that is, each becomes word character string pending In text starting position be designated Ys, each becomes word character string being designated of end position in pending text Ye, each candidate character strings in pending text starting position be designated Xs, each candidate character strings is being treated In process text, end position is designated Xe.
According to pre-conditioned, position data in list of locations is being ranked up, after obtaining position data set, should Method also includes:The state of detection inquiry switch, wherein, inquiry switch is used for query candidate character string, inquiry switch State include the first inquiry state and the second inquiry state, the first inquiry state is to open when detecting first for representing The state of query candidate character string is opened, the second inquiry state is the first end mark is detected for representing to work as during the mark that begins Stop the state of query candidate character string during knowledge;If state inquiry switch is detected is the first inquiry state, detection First sorting position data whether there is station location marker, and wherein, the first sorting position data is in position data set Current data;When station location marker is detected and station location marker starts mark for second, start to create in mark second First labelling;When station location marker is detected and station location marker is the second end of identification, judge that the second end of identification corresponds to The starting position of candidate character strings mark whether as the first labelling;If the corresponding candidate character strings of the second end of identification Starting position be labeled as the first labelling, then candidate character strings are screened out;Determine the second sorting position data, its In, the second sorting position data is next data of current data in position data set;By the second sorting position data As current data;And the first sorting position data is redefined according to current data, repeat detection inquiry and open The step of the state closed, until traversal completes position data set.
It should be noted that when detection the first sorting position data whether there is station location marker, if the first sorting position Data not existence position mark it is determined that the second sorting position data, using the second sorting position data as current data; And the first sorting position data is redefined according to current data, the step repeating the state of detection inquiry switch, Until traversal completes position data set.
In addition, screening out to candidate character strings, can be realized by following steps:Determine that at least one becomes word character String comprises at least one corresponding candidate character strings of the second end of identification;And at least one second end of identification is corresponded to Candidate character strings carry out filtration treatment.
When detecting that the first sorting position data whether there is station location marker, the method also includes:Mark when position is detected Know and station location marker be the first end of identification when, change inquiry switch state be second inquiry state;When detection puts in place Put mark and station location marker be the first end of identification when, change inquiry switch state be second inquiry state;Work as inquiry Switch state from first inquiry Status Change be second inquiry state when, if there are at least one candidate character strings Starting position is marked with the first labelling and is not detected by the second end of identification of this candidate character strings, then change this candidate word The first of symbol string is labeled as the second labelling, and wherein, second is labeled as representing that being not detected by candidate character strings is waiting to locate The labelling of starting position in reason text.
After the state of detection inquiry switch, the method also includes:If state inquiry switch is detected is second Inquiry state, then search next one-tenth word character string in position data set first starts to identify;When finding next When the first of individual one-tenth word character string starts mark, the state of change inquiry switch is the first inquiry state;And again examine Survey the first sorting position data and whether there is station location marker, until traversal completes position data set.
Specifically, the position relationship of primary importance data and second position data comprises three kinds, is to comprise, interlock respectively With from.Membership relation according to primary importance data and second position data is entered to candidate character strings in pending text Row filtration treatment, is that the candidate character strings meeting inclusion relation are filtered.Disjoint relationship, is used for being described as word Relation between character string and candidate character strings.Inclusion relation, represents that the starting position of candidate character strings is becoming word character After the starting position of string, and a kind of relation before the end position of one-tenth word character string for the end position, that is, wait The start-stop position selecting character string is all becoming the situation between the start-stop position of word character string, i.e. " Xs>=Ys&&Xe<=Ye ", Now candidate character strings are into a substring of word character string.False relation, represents in the start-stop position of candidate character strings, One and only one position falls and is becoming between the start-stop position of word character string, and another position is then located at into word character string model Outside enclosing, i.e. " (Xs>=Ys&&Xe>Ye)||(Xs<Ys&&Xe<=Ye) ", now, candidate character strings are not into word The substring of character string.Disjoint relationship, start-stop position two point representing candidate character strings is not all in the model becoming word character string Within enclosing, i.e. " (Xs<Ys&&Xe<Ys)||(Xs>Ye&&Xe>Ye) ", now, candidate character strings are not into word word The substring of symbol string.In embodiments of the present invention, by way of once traveling through marking, whole candidates are rapidly judged Whether character string has, with becoming word character string, the membership relation comprising.
Concrete grammar can be:The state (IsCheck) of detection inquiry switch, whether inquiry switch is used for judging currently Open and check candidate characters string pattern, IsCheck is initialized as false (i.e. the second above-mentioned inquiry state);Detect Beginning location matches labelling OnStart, original position matched indicia is used for indicia matched to the original position of candidate character strings.
Data (data in i.e. above-mentioned position data set) after traversal sequence, and execute following operation:
The state of detection inquiry switch, if state inquiry switch is detected is the first inquiry state, that is, IsCheck=true (i.e. the first above-mentioned inquiry state).
If(IsCheck)
If { current point (i.e. the first above-mentioned sorting position data, i.e. current data in the data acquisition system of position) is labeled For Xs (i.e. above-mentioned second starts to identify), then it represents that finding the starting position of candidate character strings, inquiry is current Candidate character strings corresponding to Xs, and indicate original position matched indicia OnStart=true (i.e. the first above-mentioned labelling);
If current point is marked as Xe (i.e. the second above-mentioned end of identification) then it represents that finding candidate character strings End position, inquires about the original position matched indicia of the candidate character strings corresponding to current Xe, if OnStart=true, Then represent and matched this candidate character strings, its relation belongs to inclusion relation;If conversely, OnStart=false is (i.e. The second above-mentioned labelling) then it represents that relation belongs to false relation;
If current point is marked as Ye (i.e. the first above-mentioned end of identification) then it represents that time for this one-tenth word character string Select character string poll-final, IsCheck is placed in false state, and close all original position matched indicias The candidate character strings of OnStart=true, their OnStart is set to false, and these candidate character strings belong to interlock Relation;
}
Else
{ whether test point is marked as Ys (i.e. above-mentioned first starts to identify), if it is, representing that current location is The beginning of one new one-tenth word character string, starts to check candidate character strings, IsCheck=true therewith;
}
End if
All candidate character strings not being judged, belong to disjoint relationship.
It should be noted that it is unlikely that becoming to be nested into the situation of word character string in word character string in traversal, that is, It is unlikely that { Ys1, Ys2..., Ye2, Ye1Situation because during new word discovery, one wheel word become The Statistic analysis of neologisms are as restriction, i.e. the one-tenth word word when filtering candidate character strings step each time according to word length The length becoming word character string in symbol tandem table is all equal, does not meet with above-mentioned situation.
In the processing method of character string in new word discovery provided in an embodiment of the present invention, by once traveling through, just complete The n relationship match becoming between word and m candidate word;Fast and effeciently reduce candidate character strings in new word discovery The order of magnitude, accelerate new word discovery efficiency;Filter out the interference of candidate character strings, do not affect sub- candidate characters simultaneously String independently as the probability being identified by statistics during word, improves the accuracy rate of new word discovery, thus solving phase Due to there is invalid candidate character strings impact new word discovery accuracy rate in the technology new word discovery task of pass.
It should be noted that the step that illustrates of flow process in accompanying drawing can be in such as one group of computer executable instructions Execute in computer system, and although showing logical order in flow charts, but in some cases, can With with the step shown or described different from order execution herein.
The embodiment of the present invention additionally provides a kind of processing meanss of character string in new word discovery, it should be noted that this In the new word discovery of bright embodiment the processing meanss of character string can be used for execute the embodiment of the present invention provided for new The processing method of character string in word discovery.Process dress to character string in new word discovery provided in an embodiment of the present invention below Put and be introduced.
Fig. 2 is the schematic diagram of the processing meanss of character string in new word discovery according to embodiments of the present invention.As shown in Fig. 2 This device includes:First determining unit 10, first acquisition unit 20 and processing unit 30.
First determining unit 10, for determining pending text, wherein, pending text includes at least one and becomes word word Symbol string and at least one candidate character strings, become word character string to be to be used in pending text forming the character string of neologisms, wait Character string is selected to be to be used in pending text forming the character string of candidate's neologisms.
First acquisition unit 20, for obtaining the membership relation of primary importance data and second position data, wherein, the One position data is the data for being expressed as word character string position in pending text, second position data be for Represent the data of candidate character strings position in pending text.
Processing unit 30, for the membership relation according to primary importance data and second position data in pending text Candidate character strings carry out filtration treatment.
In the processing meanss of character string in new word discovery provided in an embodiment of the present invention, by the first determining unit 10 Determine pending text, wherein, pending text includes at least one and becomes word character string and at least one candidate character strings, Word character string is become to be to be used in pending text forming the character string of neologisms, candidate character strings are to be used in pending text The character string of composition candidate's neologisms;First acquisition unit 20 obtains primary importance data and the subordinate of second position data is closed System, wherein, primary importance data is the data for being expressed as word character string position in pending text, second Put the data that data is for representing candidate character strings position in pending text;Processing unit 30 is according to primary importance The membership relation of data and second position data carries out filtration treatment to candidate character strings in pending text, solves phase Due to there is invalid candidate character strings impact new word discovery accuracy rate in the technology new word discovery task of pass, and then Reach the effect of the accuracy rate of new word discovery in lifting new word discovery task.
Alternatively, in order to lift the efficiency that candidate character strings in pending text are carried out with filtration treatment, real in the present invention Apply in the processing meanss of character string in the new word discovery of example offer, this device also includes:Second acquisition unit, is used for obtaining Take list of locations, wherein, list of locations is primary importance data and the list of second position data composition;Sequencing unit, For being ranked up to the position data in list of locations according to pre-conditioned, obtain position data set;3rd acquisition Unit, for obtaining the second sorting data with each candidate character strings for first sorting data that each becomes word character string, Wherein, the first sorting data is the data of affiliated one-tenth word character string start-stop position in the data acquisition system of position, the second sequence Data is the data of affiliated candidate character strings start-stop position in the data acquisition system of position, and first acquisition unit is additionally operable to judge Whether each second sorting data is included at least one first sorting data, and processing unit is additionally operable to being included in The corresponding candidate character strings of the second sorting data in one sorting data carry out filtration treatment.
Alternatively, in the processing meanss of character string in new word discovery provided in an embodiment of the present invention, this device also includes: First creating unit, for creating station location marker, station location marker includes:Each becomes word character string in pending text The first of starting position starts to identify, and each becomes the first end of identification of word character string end position in pending text, Each candidate character strings the second of starting position starts to identify in pending text, and each candidate character strings treating Process the second end of identification of end position in text.
Alternatively, in the processing meanss of character string in new word discovery provided in an embodiment of the present invention, this device also includes: First detector unit, for the state of detection inquiry switch, wherein, inquiry switch is used for query candidate character string, looks into The state asking switch includes the first inquiry state and the second inquiry state, and the first inquiry state is to detect for representing to work as First state opening query candidate character string when starting to identify, the second inquiry state is to detect first for representing to work as Stop the state of query candidate character string during end of identification;Second detector unit, in shape inquiry switch is detected In the case that state is the first inquiry state, detection the first sorting position data whether there is station location marker, wherein, first Sorting position data is the current data in position data set;Second creating unit, detects station location marker for working as And station location marker for second start mark when, second start mark on create the first labelling;Judging unit, for working as When station location marker and station location marker is detected be the second end of identification, judge the corresponding candidate character strings of the second end of identification Starting position mark whether as the first labelling;Screen out unit, in the corresponding candidate character strings of the second end of identification Starting position be labeled as the first labelling in the case of, then candidate character strings are screened out;Second determining unit, uses In determining the second sorting position data, wherein, the second sorting position data is under current data in position data set One data;3rd determining unit, for using the second sorting position data as current data;And the 4th determining unit, For the first sorting position data is redefined according to current data, repeat the step that the state of switch is inquired about in detection, Until traversal completes position data set.
It should be noted that for aforesaid each method embodiment, in order to be briefly described, therefore it is all expressed as one and be The combination of actions of row, but those skilled in the art should know, and the present invention is not subject to limiting of described sequence of movement System, because according to the present invention, some steps can be carried out using other orders or simultaneously.Secondly, art technology Personnel also should know, embodiment described in this description belongs to preferred embodiment, involved action and module Not necessarily necessary to the present invention.
In the above-described embodiments, the description to each embodiment all emphasizes particularly on different fields, and does not have the portion described in detail in certain embodiment Point, may refer to the associated description of other embodiment.
It should be understood that disclosed device in several embodiments provided herein, other sides can be passed through Formula is realized.For example, device embodiment described above is only the schematically division of for example described unit, only It is only a kind of division of logic function, actual can have other dividing mode when realizing, and for example multiple units or assembly can To combine or to be desirably integrated into another system, or some features can be ignored, or does not execute.
The described unit illustrating as separating component can be or may not be physically separate, show as unit The part showing can be or may not be physical location, you can with positioned at a place, or can also be distributed to On multiple NEs.Some or all of unit therein can be selected according to the actual needs to realize the present embodiment The purpose of scheme.
In addition, can be integrated in a processing unit in each functional unit in each embodiment of the present invention it is also possible to It is that unit is individually physically present it is also possible to two or more units are integrated in a unit.Above-mentioned integrated Unit both can be to be realized in the form of hardware, it would however also be possible to employ the form of SFU software functional unit is realized.
Obviously, those skilled in the art should be understood that each module of the above-mentioned present invention or each step can be with general Computing device realizing, they can concentrate on single computing device, or is distributed in multiple computing device institutes On the network of composition, alternatively, they can be realized with the executable program code of computing device, it is thus possible to It is stored in being executed by computing device in storage device, or they are fabricated to respectively each integrated circuit die Block, or the multiple modules in them or step are fabricated to single integrated circuit module to realize.So, the present invention It is not restricted to any specific hardware and software to combine.
The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for those skilled in the art For member, the present invention can have various modifications and variations.All within the spirit and principles in the present invention, made any Modification, equivalent, improvement etc., should be included within the scope of the present invention.

Claims (10)

1. in a kind of new word discovery character string processing method it is characterised in that include:
Determine pending text, wherein, described pending text includes at least one and becomes word character string and at least Individual candidate character strings, described one-tenth word character string is to be used in described pending text forming the character string of neologisms, institute Stating candidate character strings is to be used in described pending text forming the character string of candidate's neologisms;
Obtain the membership relation of primary importance data and second position data, wherein, described primary importance data is For representing the data of described one-tenth word character string position in described pending text, described second position data is For representing the data of described candidate character strings position in described pending text;And
Membership relation according to described primary importance data and described second position data is in described pending text Described candidate character strings carry out filtration treatment.
2. method according to claim 1 is it is characterised in that after determining pending text, obtaining first Before the membership relation of position data and second position data, methods described also includes:
Obtain list of locations, wherein, described list of locations is described primary importance data and described second position number List according to composition;
According to pre-conditioned, position data in described list of locations is ranked up, obtains position data set;
Obtain the first sorting data of each one-tenth word character string and the second sorting data of each candidate character strings, its In, described first sorting data is the data of affiliated one-tenth word character string start-stop position in described position data set, Described second sorting data is the data of affiliated candidate character strings start-stop position in described position data set,
Obtain primary importance data and the membership relation of second position data includes:Judge each second sorting data Whether it is included at least one first sorting data,
Membership relation according to described primary importance data and described second position data is in described pending text Described candidate character strings carry out filtration treatment and include:To the second sorting data pair being included in the first sorting data The candidate character strings answered carry out filtration treatment.
3. method according to claim 2 is it is characterised in that after determining pending text, described obtaining Before the membership relation of primary importance data and described second position data, methods described also includes:
Create station location marker, described station location marker includes:Each becomes word character string to open in described pending text The first of beginning position starts to identify, and each becomes the first knot of word character string end position in described pending text Beam identification, each candidate character strings the second of starting position starts to identify, and respectively in described pending text Second end of identification of individual candidate character strings end position in described pending text.
4. method according to claim 3 it is characterised in that according to pre-conditioned in described list of locations Position data is ranked up, and after obtaining position data set, methods described also includes:
The state of detection inquiry switch, wherein, described inquiry switch is used for inquiring about described candidate character strings, described The state of inquiry switch includes the first inquiry state and the second inquiry state, and described first inquiry state is for table Show the state inquiring about described candidate character strings of opening when detecting first and starting to identify, described second inquiry state It is to stop when the first end of identification is detected inquiring about the state of described candidate character strings for representing;
If the state described inquiry switch is detected is described first inquiry state, detect the first sorting position number According to the presence or absence of station location marker, wherein, described first sorting position data is working as in described position data set Front data;
When station location marker is detected and described station location marker starts mark for second, start to identify described second Upper establishment the first labelling;
When station location marker is detected and described station location marker is the second end of identification, judge that described second terminates mark The starting position knowing corresponding candidate character strings marks whether as described first labelling;If described second terminates mark The starting position knowing corresponding candidate character strings is labeled as the first labelling, then described candidate character strings are screened out;
Determine the second sorting position data, wherein, described second sorting position data is described position data set Next data of middle current data;
Using described second sorting position data as current data;And
First sorting position data is redefined according to described current data, repeats described detection inquiry switch State step, until traversal complete described position data set.
5. method according to claim 4 is it is characterised in that whether there is position in detection the first sorting position data When putting mark, methods described also includes:
When station location marker is detected and described station location marker is the first end of identification, change described inquiry switch State is described second inquiry state;
When the state of described inquiry switch is when being described second inquiry state from the described first inquiry Status Change, If the starting position that there are at least one candidate character strings is marked with the first labelling and is not detected by this candidate character strings The second end of identification, then change this candidate character strings first be labeled as the second labelling, wherein, described second It is labeled as representing the labelling being not detected by candidate character strings starting position in described pending text.
6. method according to claim 4 it is characterised in that detect described inquiry switch state after, institute Method of stating also includes:
If the state described inquiry switch is detected is the second inquiry state, search described position data set The first of middle next one-tenth word character string starts to identify;
When find described next one-tenth word character string first starts mark, the shape of change described inquiry switch State is described first inquiry state;And
Again detect that described first sorting position data whether there is described station location marker, until traversal complete described Position data set.
7. in a kind of new word discovery character string processing meanss it is characterised in that include:
First determining unit, for determining pending text, wherein, described pending text includes at least one Become word character string and at least one candidate character strings, described one-tenth word character string is to be used for group in described pending text Become the character string of neologisms, described candidate character strings are to be used in described pending text forming the character of candidate's neologisms String;
First acquisition unit, for obtaining the membership relation of primary importance data and second position data, wherein, Described primary importance data is the data for representing described one-tenth word character string position in described pending text, Described second position data is the data for representing described candidate character strings position in described pending text; And
Processing unit, for according to the membership relation of described primary importance data and described second position data to institute State candidate character strings described in pending text and carry out filtration treatment.
8. device according to claim 7 is it is characterised in that described device also includes:
Second acquisition unit, for obtaining list of locations, wherein, described list of locations is described primary importance number According to the list forming with described second position data;
Sequencing unit, for being ranked up to the position data in described list of locations according to pre-conditioned, obtains Position data set;
3rd acquiring unit, for obtaining the first sorting data and each candidate character strings that each becomes word character string The second sorting data, wherein, described first sorting data be affiliated one-tenth word character string in described position data collection The data of start-stop position in conjunction, described second sorting data is affiliated candidate character strings in described position data set The data of middle start-stop position,
Described first acquisition unit is additionally operable to judge whether each second sorting data is included at least one first row In ordinal number evidence,
Described processing unit is additionally operable to the corresponding candidate word of the second sorting data being included in the first sorting data Symbol string carries out filtration treatment.
9. device according to claim 8 is it is characterised in that described device also includes:
First creating unit, for creating station location marker, described station location marker includes:Each becomes word character string to exist In described pending text, the first of starting position starts to identify, and each becomes word character string in described pending text First end of identification of middle end position, each candidate character strings in described pending text starting position Two start to identify, and the second end of identification of each candidate character strings end position in described pending text.
10. device according to claim 9 is it is characterised in that described device also includes:
First detector unit, for the state of detection inquiry switch, wherein, described inquiry switch is used for inquiring about institute State candidate character strings, the state of described inquiry switch includes the first inquiry state and the second inquiry state, described the One inquiry state is to open, when detecting first and starting to identify, the state inquiring about described candidate character strings for representing, Described second inquiry state is to stop when the first end of identification is detected inquiring about described candidate character strings for representing State;
Second detector unit, for being the feelings of described first inquiry state in the state described inquiry switch is detected Under condition, detection the first sorting position data whether there is station location marker, wherein, described first sorting position data For the current data in described position data set;
Second creating unit, for when station location marker is detected and described station location marker for second start mark when, Start to create the first labelling in mark described second;
Judging unit, for when station location marker is detected and described station location marker is the second end of identification, judging The starting position of the corresponding candidate character strings of described second end of identification marks whether as described first labelling;
Screen out unit, be labeled as the starting position in the corresponding candidate character strings of described second end of identification In the case of one labelling, then described candidate character strings are screened out;
Second determining unit, for determining the second sorting position data, wherein, described second sorting position data Next data for current data in described position data set;
3rd determining unit, for using described second sorting position data as current data;And
4th determining unit, for redefining the first sorting position data according to described current data, repeats to hold The step of the state of row described detection inquiry switch, until traversal completes described position data set.
CN201510463437.3A 2015-07-31 2015-07-31 Method and device for processing character strings in new word discovery Pending CN106407175A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510463437.3A CN106407175A (en) 2015-07-31 2015-07-31 Method and device for processing character strings in new word discovery

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510463437.3A CN106407175A (en) 2015-07-31 2015-07-31 Method and device for processing character strings in new word discovery

Publications (1)

Publication Number Publication Date
CN106407175A true CN106407175A (en) 2017-02-15

Family

ID=58007938

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510463437.3A Pending CN106407175A (en) 2015-07-31 2015-07-31 Method and device for processing character strings in new word discovery

Country Status (1)

Country Link
CN (1) CN106407175A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107463682A (en) * 2017-08-08 2017-12-12 深圳市腾讯计算机系统有限公司 A kind of recognition methods of keyword and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101464898A (en) * 2009-01-12 2009-06-24 腾讯科技(深圳)有限公司 Method for extracting feature word of text
CN101751386A (en) * 2009-12-28 2010-06-23 华建机器翻译有限公司 Identification method of unknown words
CN101950306A (en) * 2010-09-29 2011-01-19 北京新媒传信科技有限公司 Method for filtering character strings in process of discovering new words
CN102831194A (en) * 2012-08-03 2012-12-19 人民搜索网络股份公司 New word automatic searching system and new word automatic searching method based on query log
CN102955771A (en) * 2011-08-18 2013-03-06 华东师范大学 Technology and system for automatically recognizing Chinese new words in single-word-string mode and affix mode
CN104679738A (en) * 2013-11-27 2015-06-03 北京拓尔思信息技术股份有限公司 Method and device for mining Internet hot words

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101464898A (en) * 2009-01-12 2009-06-24 腾讯科技(深圳)有限公司 Method for extracting feature word of text
CN101751386A (en) * 2009-12-28 2010-06-23 华建机器翻译有限公司 Identification method of unknown words
CN101950306A (en) * 2010-09-29 2011-01-19 北京新媒传信科技有限公司 Method for filtering character strings in process of discovering new words
CN102955771A (en) * 2011-08-18 2013-03-06 华东师范大学 Technology and system for automatically recognizing Chinese new words in single-word-string mode and affix mode
CN102831194A (en) * 2012-08-03 2012-12-19 人民搜索网络股份公司 New word automatic searching system and new word automatic searching method based on query log
CN104679738A (en) * 2013-11-27 2015-06-03 北京拓尔思信息技术股份有限公司 Method and device for mining Internet hot words

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107463682A (en) * 2017-08-08 2017-12-12 深圳市腾讯计算机系统有限公司 A kind of recognition methods of keyword and device

Similar Documents

Publication Publication Date Title
CN107909107A (en) Fiber check and measure method, apparatus and electronic equipment
CN103577475B (en) A kind of picture mechanized classification method, image processing method and its device
CN104834603B (en) A kind of controlling stream towards regression test changes domain of influence analysis method and system
CN108875624A (en) Method for detecting human face based on the multiple dimensioned dense Connection Neural Network of cascade
CN103413145B (en) Intra-articular irrigation method based on depth image
WO2019201225A1 (en) Deep learning for software defect identification
CN105072115B (en) A kind of information system intrusion detection method based on Docker virtualizations
CN109325538A (en) Object detection method, device and computer readable storage medium
CN103413124A (en) Method for detecting round traffic sign
CN108491228A (en) A kind of binary vulnerability Code Clones detection method and system
CN107465643A (en) A kind of net flow assorted method of deep learning
CN106685964A (en) Malicious software detecting method and system based on malicious network flow word library
CN109427062A (en) Roadway characteristic labeling method, device, computer equipment and readable storage medium storing program for executing
CN109344886A (en) Occlusion number plate distinguishing method based on convolutional neural network
CN112418360A (en) Convolutional neural network training method, pedestrian attribute identification method and related equipment
CN109063482A (en) Macrovirus recognition methods, device, storage medium and processor
CN106778277A (en) Malware detection methods and device
CN106874762A (en) Android malicious code detecting method based on API dependence graphs
CN116029979A (en) Cloth flaw visual detection method based on improved Yolov4
CN103810402B (en) Data processing method and device for genomes
CN106407175A (en) Method and device for processing character strings in new word discovery
CN107247955A (en) Accessory recognition methods and device
CN105469099A (en) Sparse-representation-classification-based pavement crack detection and identification method
CN109284678A (en) Guideboard method for recognizing semantics and system
CN108363967A (en) A kind of categorizing system of remote sensing images scene

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

CB02 Change of applicant information
RJ01 Rejection of invention patent application after publication

Application publication date: 20170215

RJ01 Rejection of invention patent application after publication