CN106407175A - Method and device for processing character strings in new word discovery - Google Patents
Method and device for processing character strings in new word discovery Download PDFInfo
- Publication number
- CN106407175A CN106407175A CN201510463437.3A CN201510463437A CN106407175A CN 106407175 A CN106407175 A CN 106407175A CN 201510463437 A CN201510463437 A CN 201510463437A CN 106407175 A CN106407175 A CN 106407175A
- Authority
- CN
- China
- Prior art keywords
- data
- character strings
- position data
- candidate character
- character string
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method and device for processing character strings in new word discovery. The method comprises the following steps of: determining a to-be-processed text, wherein the to-be-processed text comprises at least one bleaching character string and at least one candidate character string, the bleaching character string is a character string for forming new words in the to-be-processed text, and the candidate character string is a character string for forming candidate new words in the to-be-processed text; obtaining a relation of subordination between first position data and second position data, wherein the first position data is used for representing a position of the bleaching character string in the to-be-processed text and the second position data is used for representing a position of the candidate character string in the to-be-processed text; and filtering the candidate character string in the to-be-processed text according to the relation of subordination between the first position data and the second position data. Through the method and device disclosed by the invention, the problem that the new word discovery correctness is influenced by invalid candidate character strings in new word discovery tasks in correlation techniques is solved.
Description
Technical field
The present invention relates to new word discovery technical field, in particular to a kind of process side of character string in new word discovery
Method and device.
Background technology
New word discovery is basis and the core technology of natural language processing, and generally, the processing method of new word discovery is to utilize
Between words, the point statistic such as mutual information and left and right comentropy becomes the probability of word judging multiple continuation characters, Jin Ershi
Other neologisms.Need to firstly generate candidate character strings before Counting statistics amount, candidate character strings are the unequal companies of length
The combination of continuous character.After system-computed judges that certain candidate character strings meets neologisms condition, this candidate character strings all
Substring just shall not continue to participate in calculating as neologisms.
For example, " natural language processing " one word is judged to neologisms by system.The substring of this word, such as " natural language
Place ", " right Language Processing " etc. all should be used as invalid candidate word and be filtered, in addition, such as " natural language "
Deng substring although also there being clear and definite semanteme, being a word, but being subordinated to " natural language processing "
Rise when occurring, the statistics of substring should also be as being rejected.
But there are candidate character strings invalid in a large number in correlation technique new word discovery task, can be to big during Counting statistics amount
Measure invalid candidate character strings to be counted, the efficiency of impact new word discovery.Further, since in substring set
Much bigger than neologisms itself of mistake word ratio, without filter just carry out data statisticss when, neologisms can be directly affected
The accuracy rate of discovery task.
Affect new word discovery accuracy rate in correlation technique new word discovery task due to there are invalid candidate character strings
Problem, not yet proposes effective solution at present.
Content of the invention
Present invention is primarily targeted at providing a kind of processing method and processing device of character string in new word discovery, to solve phase
Due to there is invalid candidate character strings impact new word discovery accuracy rate in the technology new word discovery task of pass.
To achieve these goals, according to an aspect of the invention, it is provided in a kind of new word discovery character string place
Reason.The method includes:Determine pending text, wherein, pending text includes at least one and becomes word character string and extremely
Few candidate character strings, become word character string to be to be used in pending text forming the character string of neologisms, candidate character strings
It is to be used in pending text forming the character string of candidate's neologisms;Obtain primary importance data and second position data from
Genus relation, wherein, primary importance data is the data for being expressed as word character string position in pending text, the
Two position datas are the data for representing candidate character strings position in pending text;And according to primary importance number
According to the membership relation of second position data, filtration treatment is carried out to candidate character strings in pending text.
Further, after determining pending text, in the subordinate obtaining primary importance data and second position data
Before relation, the method also includes:Obtain list of locations, wherein, list of locations is primary importance data and second
Put the list of data composition;According to pre-conditioned, position data in list of locations is ranked up, obtains position data
Set;Obtain the first sorting data of each one-tenth word character string and the second sorting data of each candidate character strings, wherein,
First sorting data is the data of affiliated one-tenth word character string start-stop position in the data acquisition system of position, and the second sorting data is
The data of affiliated candidate character strings start-stop position in the data acquisition system of position, obtains primary importance data and second position number
According to membership relation include:Judge whether each second sorting data is included at least one first sorting data, root
Membership relation according to primary importance data and second position data carries out filtration treatment to candidate character strings in pending text
Including:Filtration treatment is carried out to the corresponding candidate character strings of the second sorting data being included in the first sorting data.
Further, after determining pending text, in the subordinate obtaining primary importance data and second position data
Before relation, the method also includes:Create station location marker, station location marker includes:Each becomes word character string pending
In text, the first of starting position starts to identify, and each becomes the first knot of word character string end position in pending text
Beam identification, each candidate character strings the second of starting position starts to identify in pending text, and each candidate word
Second end of identification of symbol string end position in pending text.
Further, according to pre-conditioned, position data in list of locations is being ranked up, is obtaining position data collection
After conjunction, the method also includes:The state of detection inquiry switch, wherein, inquiry switch is used for query candidate character string,
The state of inquiry switch includes the first inquiry state and the second inquiry state, and the first inquiry state is for representing when detection
To first start to identify when open the state of query candidate character string, the second inquiry state is for representing when detecting the
Stop the state of query candidate character string during one end of identification;If state inquiry switch is detected is the first inquiry shape
State, detection the first sorting position data whether there is station location marker, and wherein, the first sorting position data is position data
Current data in set;When station location marker is detected and station location marker starts mark for second, start to mark second
First labelling is created on knowledge;When station location marker is detected and station location marker is the second end of identification, judge the second end
The starting position identifying corresponding candidate character strings marks whether as the first labelling;If the corresponding time of the second end of identification
The starting position selecting character string is labeled as the first labelling, then candidate character strings are screened out;Determine the second sorting position
Data, wherein, the second sorting position data is next data of current data in position data set;Second is sorted
Position data is as current data;And the first sorting position data is redefined according to current data, repeat inspection
The step checking and examining the state asking switch, until traversal completes position data set.
Further, when detecting that the first sorting position data whether there is station location marker, the method also includes:Work as inspection
When to measure station location marker and station location marker be the first end of identification, the state of change inquiry switch is the second inquiry state;
When the state of inquiry switch is when being the second inquiry state from the first inquiry Status Change, if there is at least one candidate word
The starting position of symbol string is marked with the first labelling and is not detected by the second end of identification of this candidate character strings, then change should
The first of candidate character strings is labeled as the second labelling, and wherein, second is labeled as being not detected by candidate character strings for representing
The labelling of starting position in pending text.
Further, after the state of detection inquiry switch, the method also includes:If inquiry switch is detected
State is the second inquiry state, then search next one-tenth word character string in position data set first starts to identify;When
Find next become first when starting mark of word character string, the state of change inquiry switch is the first inquiry state;
And again detect that the first sorting position data whether there is station location marker, until traversal completes position data set.
To achieve these goals, according to a further aspect in the invention, there is provided the place of character string in a kind of new word discovery
Reason device.This device includes:First determining unit, for determining pending text, wherein, pending text includes
At least one becomes word character string and at least one candidate character strings, becomes word character string to be to be used in pending text forming newly
The character string of word, candidate character strings are to be used in pending text forming the character string of candidate's neologisms;First acquisition unit,
For obtaining the membership relation of primary importance data and second position data, wherein, primary importance data is for representing
Become the data of word character string position in pending text, second position data is for representing that candidate character strings are waiting to locate
The data of position in reason text;And processing unit, for the subordinate according to primary importance data and second position data
In the pending text of relation pair, candidate character strings carry out filtration treatment.
Further, this device also includes:Second acquisition unit, for obtaining list of locations, wherein, list of locations
List for primary importance data and second position data composition;Sequencing unit, for according to pre-conditioned to location column
Position data in table is ranked up, and obtains position data set;3rd acquiring unit, becomes word word for obtaining each
First sorting data of symbol string and the second sorting data of each candidate character strings, wherein, the first sorting data is affiliated
Become the data of word character string start-stop position in the data acquisition system of position, the second sorting data is that affiliated candidate character strings are in place
Put the data of start-stop position in data acquisition system, first acquisition unit is additionally operable to judge whether each second sorting data comprises
In at least one first sorting data, processing unit is additionally operable to the second row ordinal number being included in the first sorting data
Carry out filtration treatment according to corresponding candidate character strings.
Further, this device also includes:First creating unit, for creating station location marker, station location marker includes:
Each become word character string in pending text the first of starting position start identify, each become word character string pending
First end of identification of end position in text, each candidate character strings the second of starting position is opened in pending text
Begin mark, and the second end of identification of each candidate character strings end position in pending text.
Further, this device also includes:First detector unit, for the state of detection inquiry switch, wherein, looks into
Ask switch and be used for query candidate character string, the state of inquiry switch includes the first inquiry state and the second inquiry state, the
One inquiry state is that second looks into for representing the state opening query candidate character string when detecting first and starting to identify
Inquiry state is for representing the state stopping query candidate character string when the first end of identification is detected;Second detection is single
Unit, in the case of being the first inquiry state in state inquiry switch is detected, detects the first sorting position data
With the presence or absence of station location marker, wherein, the first sorting position data is the current data in position data set;Second wound
Build unit, for when station location marker is detected and station location marker starts mark for second, starting to create in mark second
Build the first labelling;Judging unit, for when station location marker is detected and station location marker is the second end of identification, judging
The starting position of the corresponding candidate character strings of the second end of identification marks whether as the first labelling;Screen out unit, for
In the case that the starting position of the corresponding candidate character strings of the second end of identification is labeled as the first labelling, then to candidate characters
String is screened out;Second determining unit, for determining the second sorting position data, wherein, the second sorting position data
Next data for current data in position data set;3rd determining unit, for making the second sorting position data
For current data;And the 4th determining unit, for the first sorting position data is redefined according to current data, weight
The step of the state of multiple perform detection inquiry switch, until traversal completes position data set.
In embodiments of the present invention, due to the membership relation by obtaining primary importance data and second position data, root
Membership relation according to primary importance data and second position data carries out filtration treatment to candidate character strings in pending text,
Solving in correlation technique new word discovery task affects asking of new word discovery accuracy rate due to there are invalid candidate character strings
Topic, and then reached the effect of the accuracy rate of new word discovery in lifting new word discovery task.
Brief description
The accompanying drawing constituting the part of the application is used for providing a further understanding of the present invention, the schematic reality of the present invention
Apply example and its illustrate, for explaining the present invention, not constituting inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 is the flow chart of the processing method of character string in new word discovery according to embodiments of the present invention;And
Fig. 2 is the schematic diagram of the processing meanss of character string in new word discovery according to embodiments of the present invention.
Specific embodiment
It should be noted that in the case of not conflicting, the embodiment in the application and the feature in embodiment can phases
Mutually combine.To describe the present invention below with reference to the accompanying drawings and in conjunction with the embodiments in detail.
In order that those skilled in the art more fully understand application scheme, below in conjunction with the embodiment of the present application
Accompanying drawing, is clearly and completely described the embodiment it is clear that described to the technical scheme in the embodiment of the present application
It is only the embodiment of the application part, rather than whole embodiments.Based on the embodiment in the application, ability
The every other embodiment that domain those of ordinary skill is obtained under the premise of not making creative work, all should belong to
The scope of the application protection.
It should be noted that term " first " in the description and claims of this application and above-mentioned accompanying drawing, "
Two " it is etc. for distinguishing similar object, without for describing specific order or precedence.It should be appreciated that this
The data that sample uses can be exchanged, in the appropriate case so that embodiments herein described herein.Additionally, term
" comprising " and " having " and their any deformation, it is intended that covering non-exclusive comprising, for example, comprise
The process of series of steps or unit, method, system, product or equipment are not necessarily limited to those steps clearly listed
Rapid or unit, but may include clearly not listing or intrinsic for these processes, method, product or equipment
Other steps or unit.
According to embodiments of the invention, there is provided the processing method of character string in a kind of new word discovery.
Fig. 1 is the flow chart of the processing method of character string in new word discovery according to embodiments of the present invention.As shown in figure 1,
The method includes steps S101 to step S103:
Step S101, determines pending text.
Specifically, in above-mentioned steps S101, determine pending text, wherein, pending text includes at least one
Become word character string and at least one candidate character strings, become word character string to be to be used in pending text forming the character of neologisms
String, candidate character strings are to be used in pending text forming the character string of candidate's neologisms.
It should be noted that before certain character string of system identification is neologisms, being that cannot be distinguished by which character out
String is into word character string, and which character string is candidate character strings, therefore, at the beginning of statistics, the mark of all character strings
It is designated as a character string list.After certain character string of system identification is neologisms, original character string list will
It is divided into two, be divided into into word character string and candidate character strings.
Step S102, obtains the membership relation of primary importance data and second position data.
Specifically, in above-mentioned steps S102, above-mentioned primary importance data is to wait to locate for being expressed as word character string
The data of position in reason text, above-mentioned second position data is for representing candidate character strings in pending text middle position
The data put.According to become word character string in pending text position data and candidate character strings in pending text middle position
Put data and determine primary importance data and the relation of second position data, then determine the primary importance number belonging to membership relation
According to second position data.Get primary importance data and the membership relation of second position data.
Step S103, the membership relation according to primary importance data and second position data is to candidate word in pending text
Symbol string carries out filtration treatment.
The processing method of character string in new word discovery provided in an embodiment of the present invention, by determining pending text, wherein,
Pending text includes at least one and becomes word character string and at least one candidate character strings, becomes word character string to be pending literary composition
For forming the character string of neologisms in this, candidate character strings are to be used in pending text forming the character string of candidate's neologisms;
Obtain the membership relation of primary importance data and second position data, wherein, primary importance data is for being expressed as word
The data of character string position in pending text, second position data is for representing candidate character strings in pending literary composition
The data of position in this;And according to the membership relation of primary importance data and second position data in pending text
Candidate character strings carry out filtration treatment, solve in correlation technique new word discovery task due to there are invalid candidate characters
String shadow rings the problem of new word discovery accuracy rate, and then has reached the accuracy rate of new word discovery in lifting new word discovery task
Effect.
Alternatively, in order to lift the efficiency that candidate character strings in pending text are carried out with filtration treatment, real in the present invention
Apply in the processing method of character string in the new word discovery of example offer, after determining pending text, obtaining first
Before putting data and the membership relation of second position data, the method also includes:Obtain list of locations, wherein, position
List is primary importance data and the list of second position data composition;According to pre-conditioned to the position in list of locations
Data is ranked up, and obtains position data set;Obtain the first sorting data and each candidate that each becomes word character string
Second sorting data of character string, wherein, the first sorting data is that affiliated one-tenth word character string rises in the data acquisition system of position
The data that stop bit is put, the second sorting data is the data of affiliated candidate character strings start-stop position in the data acquisition system of position,
Obtain primary importance data and the membership relation of second position data includes:Judge whether each second sorting data comprises
In at least one first sorting data, the membership relation according to primary importance data and second position data is to pending
In text, candidate character strings carry out filtration treatment and include:The second sorting data being included in the first sorting data is corresponded to
Candidate character strings carry out filtration treatment.
In the processing method of character string in new word discovery provided in an embodiment of the present invention, after determining pending text,
Before obtaining the membership relation of primary importance data and second position data, the method also includes:Create station location marker,
Station location marker includes:Each become word character string in pending text the first of starting position start identify, each become word
First end of identification of character string end position in pending text, each candidate character strings is opened in pending text
The second of beginning position starts to identify, and the second end mark of each candidate character strings end position in pending text
Know.
For example, in order to be distinguished into word character string and candidate character strings, make each one-tenth word character string is designated Y, often
One candidate character strings be designated X.Specifically, in order to preferably be distinguished into word character string and candidate character strings, with
Each becomes word character string and the positional information of candidate character strings to make a distinction, and that is, each becomes word character string pending
In text starting position be designated Ys, each becomes word character string being designated of end position in pending text
Ye, each candidate character strings in pending text starting position be designated Xs, each candidate character strings is being treated
In process text, end position is designated Xe.
According to pre-conditioned, position data in list of locations is being ranked up, after obtaining position data set, should
Method also includes:The state of detection inquiry switch, wherein, inquiry switch is used for query candidate character string, inquiry switch
State include the first inquiry state and the second inquiry state, the first inquiry state is to open when detecting first for representing
The state of query candidate character string is opened, the second inquiry state is the first end mark is detected for representing to work as during the mark that begins
Stop the state of query candidate character string during knowledge;If state inquiry switch is detected is the first inquiry state, detection
First sorting position data whether there is station location marker, and wherein, the first sorting position data is in position data set
Current data;When station location marker is detected and station location marker starts mark for second, start to create in mark second
First labelling;When station location marker is detected and station location marker is the second end of identification, judge that the second end of identification corresponds to
The starting position of candidate character strings mark whether as the first labelling;If the corresponding candidate character strings of the second end of identification
Starting position be labeled as the first labelling, then candidate character strings are screened out;Determine the second sorting position data, its
In, the second sorting position data is next data of current data in position data set;By the second sorting position data
As current data;And the first sorting position data is redefined according to current data, repeat detection inquiry and open
The step of the state closed, until traversal completes position data set.
It should be noted that when detection the first sorting position data whether there is station location marker, if the first sorting position
Data not existence position mark it is determined that the second sorting position data, using the second sorting position data as current data;
And the first sorting position data is redefined according to current data, the step repeating the state of detection inquiry switch,
Until traversal completes position data set.
In addition, screening out to candidate character strings, can be realized by following steps:Determine that at least one becomes word character
String comprises at least one corresponding candidate character strings of the second end of identification;And at least one second end of identification is corresponded to
Candidate character strings carry out filtration treatment.
When detecting that the first sorting position data whether there is station location marker, the method also includes:Mark when position is detected
Know and station location marker be the first end of identification when, change inquiry switch state be second inquiry state;When detection puts in place
Put mark and station location marker be the first end of identification when, change inquiry switch state be second inquiry state;Work as inquiry
Switch state from first inquiry Status Change be second inquiry state when, if there are at least one candidate character strings
Starting position is marked with the first labelling and is not detected by the second end of identification of this candidate character strings, then change this candidate word
The first of symbol string is labeled as the second labelling, and wherein, second is labeled as representing that being not detected by candidate character strings is waiting to locate
The labelling of starting position in reason text.
After the state of detection inquiry switch, the method also includes:If state inquiry switch is detected is second
Inquiry state, then search next one-tenth word character string in position data set first starts to identify;When finding next
When the first of individual one-tenth word character string starts mark, the state of change inquiry switch is the first inquiry state;And again examine
Survey the first sorting position data and whether there is station location marker, until traversal completes position data set.
Specifically, the position relationship of primary importance data and second position data comprises three kinds, is to comprise, interlock respectively
With from.Membership relation according to primary importance data and second position data is entered to candidate character strings in pending text
Row filtration treatment, is that the candidate character strings meeting inclusion relation are filtered.Disjoint relationship, is used for being described as word
Relation between character string and candidate character strings.Inclusion relation, represents that the starting position of candidate character strings is becoming word character
After the starting position of string, and a kind of relation before the end position of one-tenth word character string for the end position, that is, wait
The start-stop position selecting character string is all becoming the situation between the start-stop position of word character string, i.e. " Xs>=Ys&&Xe<=Ye ",
Now candidate character strings are into a substring of word character string.False relation, represents in the start-stop position of candidate character strings,
One and only one position falls and is becoming between the start-stop position of word character string, and another position is then located at into word character string model
Outside enclosing, i.e. " (Xs>=Ys&&Xe>Ye)||(Xs<Ys&&Xe<=Ye) ", now, candidate character strings are not into word
The substring of character string.Disjoint relationship, start-stop position two point representing candidate character strings is not all in the model becoming word character string
Within enclosing, i.e. " (Xs<Ys&&Xe<Ys)||(Xs>Ye&&Xe>Ye) ", now, candidate character strings are not into word word
The substring of symbol string.In embodiments of the present invention, by way of once traveling through marking, whole candidates are rapidly judged
Whether character string has, with becoming word character string, the membership relation comprising.
Concrete grammar can be:The state (IsCheck) of detection inquiry switch, whether inquiry switch is used for judging currently
Open and check candidate characters string pattern, IsCheck is initialized as false (i.e. the second above-mentioned inquiry state);Detect
Beginning location matches labelling OnStart, original position matched indicia is used for indicia matched to the original position of candidate character strings.
Data (data in i.e. above-mentioned position data set) after traversal sequence, and execute following operation:
The state of detection inquiry switch, if state inquiry switch is detected is the first inquiry state, that is,
IsCheck=true (i.e. the first above-mentioned inquiry state).
If(IsCheck)
If { current point (i.e. the first above-mentioned sorting position data, i.e. current data in the data acquisition system of position) is labeled
For Xs (i.e. above-mentioned second starts to identify), then it represents that finding the starting position of candidate character strings, inquiry is current
Candidate character strings corresponding to Xs, and indicate original position matched indicia OnStart=true (i.e. the first above-mentioned labelling);
If current point is marked as Xe (i.e. the second above-mentioned end of identification) then it represents that finding candidate character strings
End position, inquires about the original position matched indicia of the candidate character strings corresponding to current Xe, if OnStart=true,
Then represent and matched this candidate character strings, its relation belongs to inclusion relation;If conversely, OnStart=false is (i.e.
The second above-mentioned labelling) then it represents that relation belongs to false relation;
If current point is marked as Ye (i.e. the first above-mentioned end of identification) then it represents that time for this one-tenth word character string
Select character string poll-final, IsCheck is placed in false state, and close all original position matched indicias
The candidate character strings of OnStart=true, their OnStart is set to false, and these candidate character strings belong to interlock
Relation;
}
Else
{ whether test point is marked as Ys (i.e. above-mentioned first starts to identify), if it is, representing that current location is
The beginning of one new one-tenth word character string, starts to check candidate character strings, IsCheck=true therewith;
}
End if
All candidate character strings not being judged, belong to disjoint relationship.
It should be noted that it is unlikely that becoming to be nested into the situation of word character string in word character string in traversal, that is,
It is unlikely that { Ys1, Ys2..., Ye2, Ye1Situation because during new word discovery, one wheel word become
The Statistic analysis of neologisms are as restriction, i.e. the one-tenth word word when filtering candidate character strings step each time according to word length
The length becoming word character string in symbol tandem table is all equal, does not meet with above-mentioned situation.
In the processing method of character string in new word discovery provided in an embodiment of the present invention, by once traveling through, just complete
The n relationship match becoming between word and m candidate word;Fast and effeciently reduce candidate character strings in new word discovery
The order of magnitude, accelerate new word discovery efficiency;Filter out the interference of candidate character strings, do not affect sub- candidate characters simultaneously
String independently as the probability being identified by statistics during word, improves the accuracy rate of new word discovery, thus solving phase
Due to there is invalid candidate character strings impact new word discovery accuracy rate in the technology new word discovery task of pass.
It should be noted that the step that illustrates of flow process in accompanying drawing can be in such as one group of computer executable instructions
Execute in computer system, and although showing logical order in flow charts, but in some cases, can
With with the step shown or described different from order execution herein.
The embodiment of the present invention additionally provides a kind of processing meanss of character string in new word discovery, it should be noted that this
In the new word discovery of bright embodiment the processing meanss of character string can be used for execute the embodiment of the present invention provided for new
The processing method of character string in word discovery.Process dress to character string in new word discovery provided in an embodiment of the present invention below
Put and be introduced.
Fig. 2 is the schematic diagram of the processing meanss of character string in new word discovery according to embodiments of the present invention.As shown in Fig. 2
This device includes:First determining unit 10, first acquisition unit 20 and processing unit 30.
First determining unit 10, for determining pending text, wherein, pending text includes at least one and becomes word word
Symbol string and at least one candidate character strings, become word character string to be to be used in pending text forming the character string of neologisms, wait
Character string is selected to be to be used in pending text forming the character string of candidate's neologisms.
First acquisition unit 20, for obtaining the membership relation of primary importance data and second position data, wherein, the
One position data is the data for being expressed as word character string position in pending text, second position data be for
Represent the data of candidate character strings position in pending text.
Processing unit 30, for the membership relation according to primary importance data and second position data in pending text
Candidate character strings carry out filtration treatment.
In the processing meanss of character string in new word discovery provided in an embodiment of the present invention, by the first determining unit 10
Determine pending text, wherein, pending text includes at least one and becomes word character string and at least one candidate character strings,
Word character string is become to be to be used in pending text forming the character string of neologisms, candidate character strings are to be used in pending text
The character string of composition candidate's neologisms;First acquisition unit 20 obtains primary importance data and the subordinate of second position data is closed
System, wherein, primary importance data is the data for being expressed as word character string position in pending text, second
Put the data that data is for representing candidate character strings position in pending text;Processing unit 30 is according to primary importance
The membership relation of data and second position data carries out filtration treatment to candidate character strings in pending text, solves phase
Due to there is invalid candidate character strings impact new word discovery accuracy rate in the technology new word discovery task of pass, and then
Reach the effect of the accuracy rate of new word discovery in lifting new word discovery task.
Alternatively, in order to lift the efficiency that candidate character strings in pending text are carried out with filtration treatment, real in the present invention
Apply in the processing meanss of character string in the new word discovery of example offer, this device also includes:Second acquisition unit, is used for obtaining
Take list of locations, wherein, list of locations is primary importance data and the list of second position data composition;Sequencing unit,
For being ranked up to the position data in list of locations according to pre-conditioned, obtain position data set;3rd acquisition
Unit, for obtaining the second sorting data with each candidate character strings for first sorting data that each becomes word character string,
Wherein, the first sorting data is the data of affiliated one-tenth word character string start-stop position in the data acquisition system of position, the second sequence
Data is the data of affiliated candidate character strings start-stop position in the data acquisition system of position, and first acquisition unit is additionally operable to judge
Whether each second sorting data is included at least one first sorting data, and processing unit is additionally operable to being included in
The corresponding candidate character strings of the second sorting data in one sorting data carry out filtration treatment.
Alternatively, in the processing meanss of character string in new word discovery provided in an embodiment of the present invention, this device also includes:
First creating unit, for creating station location marker, station location marker includes:Each becomes word character string in pending text
The first of starting position starts to identify, and each becomes the first end of identification of word character string end position in pending text,
Each candidate character strings the second of starting position starts to identify in pending text, and each candidate character strings treating
Process the second end of identification of end position in text.
Alternatively, in the processing meanss of character string in new word discovery provided in an embodiment of the present invention, this device also includes:
First detector unit, for the state of detection inquiry switch, wherein, inquiry switch is used for query candidate character string, looks into
The state asking switch includes the first inquiry state and the second inquiry state, and the first inquiry state is to detect for representing to work as
First state opening query candidate character string when starting to identify, the second inquiry state is to detect first for representing to work as
Stop the state of query candidate character string during end of identification;Second detector unit, in shape inquiry switch is detected
In the case that state is the first inquiry state, detection the first sorting position data whether there is station location marker, wherein, first
Sorting position data is the current data in position data set;Second creating unit, detects station location marker for working as
And station location marker for second start mark when, second start mark on create the first labelling;Judging unit, for working as
When station location marker and station location marker is detected be the second end of identification, judge the corresponding candidate character strings of the second end of identification
Starting position mark whether as the first labelling;Screen out unit, in the corresponding candidate character strings of the second end of identification
Starting position be labeled as the first labelling in the case of, then candidate character strings are screened out;Second determining unit, uses
In determining the second sorting position data, wherein, the second sorting position data is under current data in position data set
One data;3rd determining unit, for using the second sorting position data as current data;And the 4th determining unit,
For the first sorting position data is redefined according to current data, repeat the step that the state of switch is inquired about in detection,
Until traversal completes position data set.
It should be noted that for aforesaid each method embodiment, in order to be briefly described, therefore it is all expressed as one and be
The combination of actions of row, but those skilled in the art should know, and the present invention is not subject to limiting of described sequence of movement
System, because according to the present invention, some steps can be carried out using other orders or simultaneously.Secondly, art technology
Personnel also should know, embodiment described in this description belongs to preferred embodiment, involved action and module
Not necessarily necessary to the present invention.
In the above-described embodiments, the description to each embodiment all emphasizes particularly on different fields, and does not have the portion described in detail in certain embodiment
Point, may refer to the associated description of other embodiment.
It should be understood that disclosed device in several embodiments provided herein, other sides can be passed through
Formula is realized.For example, device embodiment described above is only the schematically division of for example described unit, only
It is only a kind of division of logic function, actual can have other dividing mode when realizing, and for example multiple units or assembly can
To combine or to be desirably integrated into another system, or some features can be ignored, or does not execute.
The described unit illustrating as separating component can be or may not be physically separate, show as unit
The part showing can be or may not be physical location, you can with positioned at a place, or can also be distributed to
On multiple NEs.Some or all of unit therein can be selected according to the actual needs to realize the present embodiment
The purpose of scheme.
In addition, can be integrated in a processing unit in each functional unit in each embodiment of the present invention it is also possible to
It is that unit is individually physically present it is also possible to two or more units are integrated in a unit.Above-mentioned integrated
Unit both can be to be realized in the form of hardware, it would however also be possible to employ the form of SFU software functional unit is realized.
Obviously, those skilled in the art should be understood that each module of the above-mentioned present invention or each step can be with general
Computing device realizing, they can concentrate on single computing device, or is distributed in multiple computing device institutes
On the network of composition, alternatively, they can be realized with the executable program code of computing device, it is thus possible to
It is stored in being executed by computing device in storage device, or they are fabricated to respectively each integrated circuit die
Block, or the multiple modules in them or step are fabricated to single integrated circuit module to realize.So, the present invention
It is not restricted to any specific hardware and software to combine.
The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for those skilled in the art
For member, the present invention can have various modifications and variations.All within the spirit and principles in the present invention, made any
Modification, equivalent, improvement etc., should be included within the scope of the present invention.
Claims (10)
1. in a kind of new word discovery character string processing method it is characterised in that include:
Determine pending text, wherein, described pending text includes at least one and becomes word character string and at least
Individual candidate character strings, described one-tenth word character string is to be used in described pending text forming the character string of neologisms, institute
Stating candidate character strings is to be used in described pending text forming the character string of candidate's neologisms;
Obtain the membership relation of primary importance data and second position data, wherein, described primary importance data is
For representing the data of described one-tenth word character string position in described pending text, described second position data is
For representing the data of described candidate character strings position in described pending text;And
Membership relation according to described primary importance data and described second position data is in described pending text
Described candidate character strings carry out filtration treatment.
2. method according to claim 1 is it is characterised in that after determining pending text, obtaining first
Before the membership relation of position data and second position data, methods described also includes:
Obtain list of locations, wherein, described list of locations is described primary importance data and described second position number
List according to composition;
According to pre-conditioned, position data in described list of locations is ranked up, obtains position data set;
Obtain the first sorting data of each one-tenth word character string and the second sorting data of each candidate character strings, its
In, described first sorting data is the data of affiliated one-tenth word character string start-stop position in described position data set,
Described second sorting data is the data of affiliated candidate character strings start-stop position in described position data set,
Obtain primary importance data and the membership relation of second position data includes:Judge each second sorting data
Whether it is included at least one first sorting data,
Membership relation according to described primary importance data and described second position data is in described pending text
Described candidate character strings carry out filtration treatment and include:To the second sorting data pair being included in the first sorting data
The candidate character strings answered carry out filtration treatment.
3. method according to claim 2 is it is characterised in that after determining pending text, described obtaining
Before the membership relation of primary importance data and described second position data, methods described also includes:
Create station location marker, described station location marker includes:Each becomes word character string to open in described pending text
The first of beginning position starts to identify, and each becomes the first knot of word character string end position in described pending text
Beam identification, each candidate character strings the second of starting position starts to identify, and respectively in described pending text
Second end of identification of individual candidate character strings end position in described pending text.
4. method according to claim 3 it is characterised in that according to pre-conditioned in described list of locations
Position data is ranked up, and after obtaining position data set, methods described also includes:
The state of detection inquiry switch, wherein, described inquiry switch is used for inquiring about described candidate character strings, described
The state of inquiry switch includes the first inquiry state and the second inquiry state, and described first inquiry state is for table
Show the state inquiring about described candidate character strings of opening when detecting first and starting to identify, described second inquiry state
It is to stop when the first end of identification is detected inquiring about the state of described candidate character strings for representing;
If the state described inquiry switch is detected is described first inquiry state, detect the first sorting position number
According to the presence or absence of station location marker, wherein, described first sorting position data is working as in described position data set
Front data;
When station location marker is detected and described station location marker starts mark for second, start to identify described second
Upper establishment the first labelling;
When station location marker is detected and described station location marker is the second end of identification, judge that described second terminates mark
The starting position knowing corresponding candidate character strings marks whether as described first labelling;If described second terminates mark
The starting position knowing corresponding candidate character strings is labeled as the first labelling, then described candidate character strings are screened out;
Determine the second sorting position data, wherein, described second sorting position data is described position data set
Next data of middle current data;
Using described second sorting position data as current data;And
First sorting position data is redefined according to described current data, repeats described detection inquiry switch
State step, until traversal complete described position data set.
5. method according to claim 4 is it is characterised in that whether there is position in detection the first sorting position data
When putting mark, methods described also includes:
When station location marker is detected and described station location marker is the first end of identification, change described inquiry switch
State is described second inquiry state;
When the state of described inquiry switch is when being described second inquiry state from the described first inquiry Status Change,
If the starting position that there are at least one candidate character strings is marked with the first labelling and is not detected by this candidate character strings
The second end of identification, then change this candidate character strings first be labeled as the second labelling, wherein, described second
It is labeled as representing the labelling being not detected by candidate character strings starting position in described pending text.
6. method according to claim 4 it is characterised in that detect described inquiry switch state after, institute
Method of stating also includes:
If the state described inquiry switch is detected is the second inquiry state, search described position data set
The first of middle next one-tenth word character string starts to identify;
When find described next one-tenth word character string first starts mark, the shape of change described inquiry switch
State is described first inquiry state;And
Again detect that described first sorting position data whether there is described station location marker, until traversal complete described
Position data set.
7. in a kind of new word discovery character string processing meanss it is characterised in that include:
First determining unit, for determining pending text, wherein, described pending text includes at least one
Become word character string and at least one candidate character strings, described one-tenth word character string is to be used for group in described pending text
Become the character string of neologisms, described candidate character strings are to be used in described pending text forming the character of candidate's neologisms
String;
First acquisition unit, for obtaining the membership relation of primary importance data and second position data, wherein,
Described primary importance data is the data for representing described one-tenth word character string position in described pending text,
Described second position data is the data for representing described candidate character strings position in described pending text;
And
Processing unit, for according to the membership relation of described primary importance data and described second position data to institute
State candidate character strings described in pending text and carry out filtration treatment.
8. device according to claim 7 is it is characterised in that described device also includes:
Second acquisition unit, for obtaining list of locations, wherein, described list of locations is described primary importance number
According to the list forming with described second position data;
Sequencing unit, for being ranked up to the position data in described list of locations according to pre-conditioned, obtains
Position data set;
3rd acquiring unit, for obtaining the first sorting data and each candidate character strings that each becomes word character string
The second sorting data, wherein, described first sorting data be affiliated one-tenth word character string in described position data collection
The data of start-stop position in conjunction, described second sorting data is affiliated candidate character strings in described position data set
The data of middle start-stop position,
Described first acquisition unit is additionally operable to judge whether each second sorting data is included at least one first row
In ordinal number evidence,
Described processing unit is additionally operable to the corresponding candidate word of the second sorting data being included in the first sorting data
Symbol string carries out filtration treatment.
9. device according to claim 8 is it is characterised in that described device also includes:
First creating unit, for creating station location marker, described station location marker includes:Each becomes word character string to exist
In described pending text, the first of starting position starts to identify, and each becomes word character string in described pending text
First end of identification of middle end position, each candidate character strings in described pending text starting position
Two start to identify, and the second end of identification of each candidate character strings end position in described pending text.
10. device according to claim 9 is it is characterised in that described device also includes:
First detector unit, for the state of detection inquiry switch, wherein, described inquiry switch is used for inquiring about institute
State candidate character strings, the state of described inquiry switch includes the first inquiry state and the second inquiry state, described the
One inquiry state is to open, when detecting first and starting to identify, the state inquiring about described candidate character strings for representing,
Described second inquiry state is to stop when the first end of identification is detected inquiring about described candidate character strings for representing
State;
Second detector unit, for being the feelings of described first inquiry state in the state described inquiry switch is detected
Under condition, detection the first sorting position data whether there is station location marker, wherein, described first sorting position data
For the current data in described position data set;
Second creating unit, for when station location marker is detected and described station location marker for second start mark when,
Start to create the first labelling in mark described second;
Judging unit, for when station location marker is detected and described station location marker is the second end of identification, judging
The starting position of the corresponding candidate character strings of described second end of identification marks whether as described first labelling;
Screen out unit, be labeled as the starting position in the corresponding candidate character strings of described second end of identification
In the case of one labelling, then described candidate character strings are screened out;
Second determining unit, for determining the second sorting position data, wherein, described second sorting position data
Next data for current data in described position data set;
3rd determining unit, for using described second sorting position data as current data;And
4th determining unit, for redefining the first sorting position data according to described current data, repeats to hold
The step of the state of row described detection inquiry switch, until traversal completes described position data set.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510463437.3A CN106407175A (en) | 2015-07-31 | 2015-07-31 | Method and device for processing character strings in new word discovery |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510463437.3A CN106407175A (en) | 2015-07-31 | 2015-07-31 | Method and device for processing character strings in new word discovery |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106407175A true CN106407175A (en) | 2017-02-15 |
Family
ID=58007938
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510463437.3A Pending CN106407175A (en) | 2015-07-31 | 2015-07-31 | Method and device for processing character strings in new word discovery |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106407175A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107463682A (en) * | 2017-08-08 | 2017-12-12 | 深圳市腾讯计算机系统有限公司 | A kind of recognition methods of keyword and device |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101464898A (en) * | 2009-01-12 | 2009-06-24 | 腾讯科技(深圳)有限公司 | Method for extracting feature word of text |
CN101751386A (en) * | 2009-12-28 | 2010-06-23 | 华建机器翻译有限公司 | Identification method of unknown words |
CN101950306A (en) * | 2010-09-29 | 2011-01-19 | 北京新媒传信科技有限公司 | Method for filtering character strings in process of discovering new words |
CN102831194A (en) * | 2012-08-03 | 2012-12-19 | 人民搜索网络股份公司 | New word automatic searching system and new word automatic searching method based on query log |
CN102955771A (en) * | 2011-08-18 | 2013-03-06 | 华东师范大学 | Technology and system for automatically recognizing Chinese new words in single-word-string mode and affix mode |
CN104679738A (en) * | 2013-11-27 | 2015-06-03 | 北京拓尔思信息技术股份有限公司 | Method and device for mining Internet hot words |
-
2015
- 2015-07-31 CN CN201510463437.3A patent/CN106407175A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101464898A (en) * | 2009-01-12 | 2009-06-24 | 腾讯科技(深圳)有限公司 | Method for extracting feature word of text |
CN101751386A (en) * | 2009-12-28 | 2010-06-23 | 华建机器翻译有限公司 | Identification method of unknown words |
CN101950306A (en) * | 2010-09-29 | 2011-01-19 | 北京新媒传信科技有限公司 | Method for filtering character strings in process of discovering new words |
CN102955771A (en) * | 2011-08-18 | 2013-03-06 | 华东师范大学 | Technology and system for automatically recognizing Chinese new words in single-word-string mode and affix mode |
CN102831194A (en) * | 2012-08-03 | 2012-12-19 | 人民搜索网络股份公司 | New word automatic searching system and new word automatic searching method based on query log |
CN104679738A (en) * | 2013-11-27 | 2015-06-03 | 北京拓尔思信息技术股份有限公司 | Method and device for mining Internet hot words |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107463682A (en) * | 2017-08-08 | 2017-12-12 | 深圳市腾讯计算机系统有限公司 | A kind of recognition methods of keyword and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107909107A (en) | Fiber check and measure method, apparatus and electronic equipment | |
CN103577475B (en) | A kind of picture mechanized classification method, image processing method and its device | |
CN104834603B (en) | A kind of controlling stream towards regression test changes domain of influence analysis method and system | |
CN108875624A (en) | Method for detecting human face based on the multiple dimensioned dense Connection Neural Network of cascade | |
CN103413145B (en) | Intra-articular irrigation method based on depth image | |
WO2019201225A1 (en) | Deep learning for software defect identification | |
CN105072115B (en) | A kind of information system intrusion detection method based on Docker virtualizations | |
CN109325538A (en) | Object detection method, device and computer readable storage medium | |
CN103413124A (en) | Method for detecting round traffic sign | |
CN108491228A (en) | A kind of binary vulnerability Code Clones detection method and system | |
CN107465643A (en) | A kind of net flow assorted method of deep learning | |
CN106685964A (en) | Malicious software detecting method and system based on malicious network flow word library | |
CN109427062A (en) | Roadway characteristic labeling method, device, computer equipment and readable storage medium storing program for executing | |
CN109344886A (en) | Occlusion number plate distinguishing method based on convolutional neural network | |
CN112418360A (en) | Convolutional neural network training method, pedestrian attribute identification method and related equipment | |
CN109063482A (en) | Macrovirus recognition methods, device, storage medium and processor | |
CN106778277A (en) | Malware detection methods and device | |
CN106874762A (en) | Android malicious code detecting method based on API dependence graphs | |
CN116029979A (en) | Cloth flaw visual detection method based on improved Yolov4 | |
CN103810402B (en) | Data processing method and device for genomes | |
CN106407175A (en) | Method and device for processing character strings in new word discovery | |
CN107247955A (en) | Accessory recognition methods and device | |
CN105469099A (en) | Sparse-representation-classification-based pavement crack detection and identification method | |
CN109284678A (en) | Guideboard method for recognizing semantics and system | |
CN108363967A (en) | A kind of categorizing system of remote sensing images scene |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing Applicant after: Beijing Guoshuang Technology Co.,Ltd. Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing Applicant before: Beijing Guoshuang Technology Co.,Ltd. |
|
CB02 | Change of applicant information | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170215 |
|
RJ01 | Rejection of invention patent application after publication |