CN105988996A - Index file generation method and device - Google Patents
Index file generation method and device Download PDFInfo
- Publication number
- CN105988996A CN105988996A CN201510039519.5A CN201510039519A CN105988996A CN 105988996 A CN105988996 A CN 105988996A CN 201510039519 A CN201510039519 A CN 201510039519A CN 105988996 A CN105988996 A CN 105988996A
- Authority
- CN
- China
- Prior art keywords
- field
- data content
- data
- configuration
- index file
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses an index file generation method and device. The index file generation method includes: acquiring service data, wherein the service data includes data content and a service type; acquiring a corresponding configuration file according to the service type, wherein the configuration file includes a field pretreatment instruction and a segmentation treatment instruction; performing pretreatment on the data content according to the field pretreatment instruction to generate the pretreated data content; and performing segmentation treatment on the pretreated data content according to the segmentation treatment instruction, and performing ranking treatment on the data content to generate an index file having an unified data format after segmentation treatment. The index file generation method and device can uniformly establish index files for various service types of data, can simplify an establishment process, and can improve the efficiency.
Description
Technical field
The invention belongs to communication technical field, particularly relate to a kind of index file and generate method and device.
Background technology
Along with developing rapidly of computer and Internet technology, the quantity of information stored in the network device is also got over
It is more huge, for the ease of these information are inquired about, generally requires by setting up the sides such as index file
Formula assists user to conduct interviews these information.
In the prior art, it is typically necessary the type of service of data carrying out as required retrieving and generates correspondence
In-line arrangement index file, then this in-line arrangement index file is carried out the process of falling row, obtains inverted index file,
So that the data of this type of service are retrieved by user according to this inverted index file.And for different business
The data of type, owing to the factors such as the keyword that it is involved are different, so, in the prior art, for
The data of different service types, need independently to set up an index generation system, enter for user to generate to index
Line retrieval.
To in the research of prior art and practice process, it was found by the inventors of the present invention that the rope of existing scheme
Derivation become system can only for a kind of type of service, so, under the scene that type of service is more, need to take
Build many lasso tricks derivation and become system, and the professional standards to operator of setting up of this system require higher, whole
The process of individual foundation is more time-consuming, and efficiency is low.
Summary of the invention
It is an object of the invention to provide a kind of index file and generate method and device, can be for multiple business number
Set up index file according to type, simplify process of setting up, improve efficiency.
For solving above-mentioned technical problem, embodiment of the present invention offer techniques below scheme:
First aspect present invention provides a kind of index file to generate method, the method comprise the steps that
Obtaining business datum, described business datum includes data content and type of service;
Obtaining corresponding configuration file according to described type of service, described configuration file includes place pre-to field
Reason instruction and word segmentation processing instruction;
Indicate according to described field pretreatment, described data content is carried out pretreatment, generates pretreated
Data content;
Indicate according to described word segmentation processing, described pretreated data content is carried out word segmentation processing respectively;
Data content after word segmentation processing is carried out in-line arrangement process, generates the index file of Uniform data format.
For solving above-mentioned technical problem, embodiment of the present invention offer techniques below scheme:
Second aspect present invention provides a kind of index file generating means, and wherein said device includes:
First acquisition module, is used for obtaining business datum, and described business datum includes data content and service class
Type;
Second acquisition module, for obtaining corresponding configuration file, described configuration according to described type of service
File includes indicating field pretreatment instruction and word segmentation processing;
Pretreatment module, for indicating according to described field pretreatment, carries out pretreatment to described data content,
Generate pretreated data content;
Word-dividing mode, for indicating according to described word segmentation processing, to described pretreated data content respectively
Carry out word segmentation processing;
Index generation module, for the data content after word segmentation processing carries out in-line arrangement process, generates unified number
Index file according to form.
Relative to prior art, in the present embodiment, according to the type of service of multiple business data, obtain relatively
The configuration file answered, indicates according to the field pretreatment of configuration file thereafter, the data content to business datum
Carry out pretreatment, indicate according to the word segmentation processing of configuration file, pretreated data content is carried out respectively
Word segmentation processing, thus generate the index file of Uniform data format.The present invention is directed to the number of different service types
According to using corresponding configuration file that data are processed, use thereafter identical program that data content is entered
Row participle, is normalized to the index data of Uniform data format by the business datum of different-format, thus can pin
Set up index file to multiple business data type is unified, simplify process of setting up, improve efficiency.
Accompanying drawing explanation
Below in conjunction with the accompanying drawings, by the detailed description of the invention of the present invention is described in detail, the skill of the present invention will be made
Art scheme and other beneficial effect are apparent.
Fig. 1 is the schematic flow sheet that the index file that first embodiment of the invention provides generates method;
Fig. 2 a generates the schematic flow sheet of method for the index file that second embodiment of the invention provides;
Fig. 2 b and Fig. 2 c generates the configuration interface schematic diagram of method field for the index file that the present invention provides;
Fig. 3 a and Fig. 3 b generates the schematic flow sheet of method for the index file that third embodiment of the invention provides;
The structural representation of the index file generating means that Fig. 4 provides for fourth embodiment of the invention;
The structural representation of the index file generating means that Fig. 5 provides for fifth embodiment of the invention;
The structural representation of the server that Fig. 6 provides for sixth embodiment of the invention.
Detailed description of the invention
Refer to graphic, the most identical element numbers represents identical assembly, and the principle of the present invention is with reality
The computing environment that Shi Yi is suitable illustrates.The following description is concrete based on the illustrated present invention
Embodiment, it is not construed as limiting other specific embodiment that the present invention is the most detailed herein.
In the following description, the specific embodiment of the present invention will be with reference to by performed by one or multi-section computer
Step and symbol illustrate, unless otherwise stating clearly.Therefore, these steps and operation will have mention for several times by
Computer performs, and computer as referred to herein performs to include by representing with the data in a structuring pattern
The operation of computer processing unit of electronic signal.This operation is changed these data or is maintained at this calculating
Position in the memory system of machine, it is reconfigurable or other with the side known to the tester of this area
Formula changes the running of this computer.The data structure that these data are maintained is the provider location of this internal memory, its
Have by particular characteristics defined in this data form.But, the principle of the invention illustrates with above-mentioned word,
It is not represented as a kind of restriction, and this area tester will appreciate that plurality of step and the behaviour of the following stated
Also may be implemented in the middle of hardware.
The principle of the present invention uses other wide usages many or specific purpose computing, communication environment or configuration to enter
Row operation.Known be suitable for the arithmetic system of the present invention, environment can include with the example of configuration (but not
It being limited to) hand-held phone, personal computer, server, multicomputer system, micro computer be main system, master
Architected computer and distributed computing environment, which includes any said system or device.
Term as used herein " module " can regard the software object as performing on this arithmetic system as.This
Different assemblies, module, engine and service described in literary composition can be regarded as the objective for implementation on this arithmetic system.
And device and method as herein described is preferably implemented in the way of software, the most also can be enterprising at hardware
Row is implemented, all within scope.
And word " preferably " used herein means serving as example, example or illustration.Feng Wen is described as
" preferably " any aspect or design are not necessarily to be construed as more favourable than other aspects or design.On the contrary, word
The use of " preferably " is intended to propose in a concrete fashion concept.Term "or" is intended to as used in this application
Mean the "or" that comprises and non-excluded "or".I.e., unless otherwise or the clearest, " X makes
With A or B " mean that nature includes any one of arrangement.That is, if X uses A;X uses B;Or X
Use A and B both, then " X uses A or B " is met in aforementioned any example.
And, although illustrate and describing the disclosure relative to one or more implementations, but this
Skilled person will appreciate that equivalent variations and amendment based on to reading and the understanding of the specification and drawings.
The disclosure includes all such amendments and modification, and is limited only by the scope of the following claims.Especially
Ground, about the various functions performed by said modules (such as element, resource etc.), is used for describing such group
The term of part is intended to the appointment function (such as it is functionally of equal value) corresponding to performing described assembly
Random component (unless otherwise instructed), though structurally with perform the disclosure shown in this article exemplary reality
The open structure of the function in existing mode is not equal to.Although additionally, the special characteristic of the disclosure relative to
Only one in some implementations is disclosed, but this feature can with such as can to given or specific should
It it is other features one or more combination of expectation and other favourable implementations for.And, with regard to art
Language " includes ", " having ", " containing " or its deformation be used in detailed description of the invention or claim for,
Such term be intended to by " comprise " to term similar in the way of include.
First embodiment
Refer to the flow process that Fig. 1, Fig. 1 are the index file generation methods that first embodiment of the invention provides show
It is intended to.Described method step includes:
In step S101, obtaining business datum, described business datum includes data content and type of service.
Wherein, described index file generates method is based on BS (browser browser, server)
System structure, user uses this system by browser, and this system supports the data of multiple business type
The index data of Uniform data format is generated under identical platform.
In the present embodiment, described type of service may include that video, music, picture etc., corresponding,
Described business datum can include video data, music data and image data etc., the most specifically limits
Fixed.
It is understood that the data form of the business datum in the present embodiment can be divided into two parts, its
In one part carry service class indication information, another part carries the data that this type of service is corresponding
Content.
In step s 102, corresponding configuration file, described configuration file are obtained according to described type of service
Indicate including to field pretreatment instruction and word segmentation processing.
It is understood that the corresponding a kind of configuration file of each type of service meeting, wherein, described configuration literary composition
Part is that user is pre-configured with according to the feature of the type of service in practical operation and stores in the server.
Wherein, described configuration file contains the field to described data content and carries out the instruction of pretreatment,
And the field of described data content is carried out the instruction of participle, described configuration file according to user to each business
The configuration of the field of data generates, and the configuration to field herein is not especially limited.
In step s 103, indicate according to described field pretreatment, described data content carried out pretreatment,
Generate pretreated data content.
In step S104, indicate according to described word segmentation processing, to described pretreated data content respectively
Carry out word segmentation processing;
In step S105, the data content after word segmentation processing is carried out in-line arrangement process, generate uniform data lattice
The index file of formula.
It is understood that described step S103 may particularly include to step S105:
Due to the corresponding configuration file of each type of service, the corresponding field pretreatment of the most each type of service refers to
Showing, each type of service according to corresponding field pretreatment instruction, carries out pre-place to described data content respectively
Reason, can embody the personalized differential operation between different service types;After pretreatment, can be according to platform
The word segment template preset and the word segmentation processing preset instruction process, and are i.e. normalized operation, will
The business datum of different-format, sends into in-line arrangement processing unit (FSU, Forward Sort Unit) and carries out in-line arrangement
Index generates, and is normalized to unified data form, has obtained the in-line arrangement data after normalization, many to adapt to
Plant the data retrieval of type of service.
From the foregoing, in the present embodiment, according to the type of service of multiple business data, obtain corresponding
Configuration file, indicates according to the field pretreatment of configuration file thereafter, carries out the data content of business datum
Pretreatment, indicates according to the word segmentation processing of configuration file, pretreated data content carries out participle respectively
Process, thus generate the index file of Uniform data format.The present invention is directed to the data acquisition of different service types
With corresponding configuration file, data are processed, use thereafter identical program data content to be carried out point
Word, is normalized to the index data of Uniform data format by the business datum of different-format, thus can be for many
Planting the unified index file of setting up of traffic data type, simplification is set up process, is improved efficiency.
Second embodiment
The flow process referring to the index file generation method that Fig. 2, Fig. 2 provide for second embodiment of the invention is shown
It is intended to.Wherein, it is based on BS (browser, server) that the index file that the present invention provides generates method
System structure, user uses this system by browser, and this system supports that the data of multiple business type exist
The index data of Uniform data format is generated under identical platform.
In embodiments of the present invention, the property value mainly for the generation of configuration file, i.e. field configures and carries out
Analyzing, described method step includes:
In step s 201, the configuration file corresponding to different service types is generated respectively.
It is understood that the corresponding a kind of configuration file of each type of service meeting, wherein, described service class
Type may include that video, music, picture etc., corresponding, described business datum include video data,
Music data and image data.
In the present embodiment, described configuration file is user according to the feature of the type of service in practical operation in advance
Configure and store in the server, described configuration file contains the field to described data content and carries out
The instruction of pretreatment, and the field of described data content is carried out the instruction of participle.
In a preferred embodiment, described configuration file can obtain based on following steps:
Step (1), obtain the field configuration information corresponding with type of service;
The property value of multiple fields that the instruction of described field configuration information is preset, described field includes textview field word
Section, Numerical Range field and sorting field field;
It is understood that business datum of the present invention includes data content and type of service, in described data
Appearance includes multiple document, and document is made up of multiple fields, and wherein the type of field can carry out preset, bag
Include textview field field, Numerical Range field and sorting field field.
Further, described textview field field refers to the field of plain textual information, such as: " I likes this
Singer ", the field etc. of " this song is really pleasant ", described Numerical Range field refer to represent the numeral of numerical value or
The field of alphabetical information, such as: " 1 ", " 5 " or " one ", the field etc. of " five ", described point
Class field field refers to that data are carried out the field classified by instruction, such as: a song can be classified as " shaking
Rolling class ", " jazz's class " etc., a video can be classified as " film ", " variety ", " news "
Deng.
It addition, each field includes at least one attribute, it is possible to claiming configuration item, described property value is by choice box
Form be shown, select for user and configure.
The property value of the plurality of field is carried out by step (2), instruction according to the configuration information of described field
Configuration, obtains the configuration file corresponding with described type of service.
According to user's configuration to the property value of the attribute of each type of field, obtain and described service class
The configuration file that type is corresponding.
Based on this, in further preferred embodiment, can come described many based on mode in detail below
The property value of individual field configures, i.e. step (2) can specifically include:
Step (21), according to the instruction of the configuration information of described field to the attribute of described textview field field
Property value configures, the textview field field after being configured.
In the present embodiment, what described textview field field mainly comprised is Word message, and wishes to be searched for by user
The field arrived;The attribute of described textview field field can include description, data length, major key, importance and
One or more combination in participle mode.
Can be in the lump with reference to the attribute configuration interface signal that Fig. 2 b and Fig. 2 c, Fig. 2 b are field, Fig. 2 c is for using
Family custom field administration interface signal, is carried out the implication of above-mentioned each attribute of described textview field field below
Simple declaration:
A, description: refer to the implication of this field references, play suggesting effect, and Search Results is not affected by this attribute.
B, data length: refer to the greatest length of this field text.At present according to field whether more than 256 bytes
Being divided into two grades, greatest length is referred to as long text field, wherein in whole textview field more than the field of 256 bytes
In, only one of which field is configurable to long text field.
C, major key: namely major key (primary key), be used for unique field identifying a document,
It is referred to as doc_id.Wherein, this field is arranged to change into the value of numeral, and concrete, the value of doc_id is
One 64 integer.Owing to this value in the space of uint64_t uniformly should the most preferably use Hash
Values etc. produce, and wherein, hash value is the numerical value obtained by logical operations according to data content, different literary compositions
The hash value that shelves obtain is different, and hash value has just become the identity card of each document.
D, importance: be the significance level representing text field, can be divided into important, typically and do not weigh
Wait.
E, participle mode: be divided into normal participle and prefix participle.Wherein, normal participle refers to according to nature
Semanteme carries out participle to text, generally can give tacit consent to selection which;Prefix participle is applicable to search box
The scene of prompting combobox.Can be divided into such as " inner search platform part " " interior, internal, internal search, internal
Search ... " etc. word, when such user inputs " internal " in the search box, it is possible to point out " inside is searched
Rope platform part ".
It is understood that word segmentation processing instruction can be obtained according to the configuration of this participle mode, with root
Carry out data content is carried out word segmentation processing according to word segmentation processing instruction.
Step (22), according to the instruction of the configuration information of described field to the attribute of described Numerical Range field
Property value configures, the Numerical Range field after being configured.
In the present embodiment, the attribute of described Numerical Range field include description, data type, authority, importance,
One or more combination in major key.
Described Numerical Range field is applicable to the information of value type.Such as price, download etc..In this field
String value must can be converted into numeral.Hereinafter the implication of each attribute of described Numerical Range field is carried out letter
Unitary declaration:
A, description: refer to the implication of this field references, play suggesting effect, and Search Results is not affected by this attribute.
B, data type: in this embodiment, configuration item can be provided with int8, uint8, int16, uint16, int32,
Uint32, int64, uint64 and float several types is available.User is according to the possible maximum of this numerical value
Scope selects, provided that data in actual value exceed the scope of configuration, it will make mistakes.
C, authority: be used for representing that this field can embody the authority of this document.Such as, video is searched
Rope, can select to watch number as authoritative field.Only 0 or 1 Numerical Range field can be appointed as power
Prestige field.
D, importance: be the significance level representing this field, can be divided into important, general and inessential etc..
E, major key: the major key definition with textview field field is identical, also refers to major key, be used for unique mark
Know the field of a document.It is referred to as doc_id.Wherein, this field is arranged to change into the value of numeral, tool
Body, the value of doc_id is 64 integers;Due to this value should in the space of uint64_t uniformly,
Therefore preferably employ hash value etc. to produce.
The attribute of described sorting field field is entered by step (23), instruction according to the configuration information of described field
Row configuration, the sorting field field after being configured;
In the present embodiment, the attribute of described sorting field field includes that classification is specified in retrieval;
Step (24), according to the textview field field after described configuration, configuration after Numerical Range field and configuration
After sorting field field generate the configuration file corresponding with described type of service.
In step S202, obtain business datum.
Wherein, described business datum includes data content and type of service;Described type of service may include that
Video, music, picture etc., corresponding, described business datum can include video data, music data
And image data etc., it is not especially limited herein.
It is understood that after generating corresponding to the configuration file of different service types, by described configuration literary composition
Part is preset in server, thereafter, after getting the business datum of user data, trigger server according to
Its type of service, recalls the configuration file corresponding with type of service from preset multiple configuration files, from
And process according to configuration file, generate index file.
In step S203, obtain corresponding configuration file according to described type of service.
It is understood that the corresponding a kind of configuration file of each type of service meeting, wherein, described configuration literary composition
Part is that user previously generates according to the configuration information in step S201, and stores in the server.
In step S204, indicate according to described field pretreatment, described data content carried out pretreatment,
Generate pretreated data content.
Due to the corresponding configuration file of each type of service, the corresponding field pretreatment of the most each type of service refers to
Showing, each type of service according to corresponding field pretreatment instruction, carries out pre-place to described data content respectively
Reason, as some field of service propelling data is rewritten, data cleansing, supplementary data label etc., can
To embody the personalized differential operation between different service types.
In step S205, it is analyzed determining described data content to described pretreated data content
Attribute information.
In some embodiments, preset word segment template can be obtained, according to described word segment template to described
Pretreated data content is analyzed, and determines the attribute information of described data content.Wherein, described clothes
Business device pre-sets multiple word-dividing mode, it may include the data template of multiple types of service, such as music
Data, then can include singer data base, song name data storehouse and school database, to it in data template
It is analyzed, then can learn the attribute information of this data content;Such as, if this data content belongs to music
Type of service, then attribute information refers to the attribute of the value types such as the download of song, playback volume.
In step S206, according to the instruction of described word segmentation processing and described attribute information, to described pretreatment
After business datum carry out participle, and the data content after word segmentation processing is carried out in-line arrangement process, generates unified
The in-line arrangement index file of data form.
After pretreatment, can indicate according to described attribute information and the word segmentation processing preset and process, i.e.
It is normalized operation, by the business datum of different-format, is normalized to unified data form, obtains
In-line arrangement data after normalization, to adapt to the data retrieval of multiple business type.
It is understood that after carrying out pretreatment, data can enter in-line arrangement processing unit FSU, carries out suitable
Row's index generates.Indicated by the word segmentation processing configured in configuration file, and according to built-in several points
The search such as word template carries out data process, calculates wordid, word POS information need to use the data letter arrived
Breath, finally exports the in-line arrangement index file of consolidation form.
It is understood that after generating the in-line arrangement index file of Uniform data format, it is also possible to including:
In step S207, described in-line arrangement index file is converted to inverted index file, in order to user according to
Described inverted index file is retrieved.
From the foregoing, in the present embodiment, according to the type of service of multiple business data, obtain corresponding
Configuration file, indicates according to the field pretreatment of configuration file thereafter, carries out the data content of business datum
Pretreatment, indicates according to the word segmentation processing of configuration file, pretreated data content carries out participle respectively
Process, thus generate the index file of Uniform data format.The present invention is directed to the data acquisition of different service types
With corresponding configuration file, data are processed, use thereafter identical program data content to be carried out point
Word, is normalized to the index data of Uniform data format by the business datum of different-format, thus can be for many
Planting the unified index file of setting up of traffic data type, simplification is set up process, is improved efficiency.
3rd embodiment
Refer to Fig. 3 a and Fig. 3 b, for the stream of the index file generation method that third embodiment of the invention provides
Journey schematic diagram.Wherein, it is based on BS (browser, server) that the index file that the present invention provides generates method
System structure, user uses this system by browser, and this system supports the data of multiple business type
The index data of Uniform data format is generated under identical platform.
In embodiments of the present invention, the process carrying out pretreatment mainly for data content is analyzed, described
Method step includes:
In step S301, obtain business datum.
Wherein, described business datum includes data content and type of service;Described type of service may include that
Video, music, picture etc., corresponding, described business datum can include video data, music data
And image data etc., it is not especially limited herein.
It is understood that after generating corresponding to the configuration file of different service types, by described configuration literary composition
Part is preset in server, thereafter, after getting the business datum of user data, trigger server according to
Its type of service, recalls the configuration file corresponding with type of service from preset multiple configuration files, from
And process according to configuration file, generate index file.
In step s 302, corresponding configuration file is obtained according to described type of service.
It is understood that the corresponding a kind of configuration file of each type of service meeting, wherein, described configuration literary composition
Part contains configuration file include field pretreatment instruction and word segmentation processing are indicated, described configuration file
It is that user is pre-configured with according to the feature of the type of service in practical operation and stores in the server.
More preferred, before obtaining business datum (i.e. step S301), it is also possible to including: give birth to respectively
Become the configuration file corresponding to different service types, concrete, can first obtain the word corresponding with type of service
Section configuration information, enters the property value of the plurality of field according to the instruction of the configuration information of described field thereafter
Row configuration, obtains the configuration file corresponding with described type of service.
Wherein, in the embodiment of the present invention, described field can include textview field field, Numerical Range field sum
Codomain field, each field includes the attribute of correspondence respectively, thereafter can be according to the finger of the configuration information of each field
Show that attribute configures, thus generate configuration file;It is contemplated that generate corresponding to different business class
The description of step S201 that the content of the configuration file of type refers to above-described embodiment implements, herein
Repeat no more.
It is understood that described server can include the dynamic base of an index data pretreatment, mainly
It is after getting configuration file, can indicate according to the field pretreatment in configuration file, to described data
Content carries out pretreatment, thus generates pretreated data content.
In the present embodiment, described data content is carried out pretreatment and mainly includes data cleansing and data rewriting,
Wherein the execution sequence for data cleansing and data rewriting is not construed as limiting, the most both can advanced row data clear
Wash, then carry out data rewriting, it is also possible to advanced row data rewriting, then carry out data cleansing, it is also possible to both
Perform simultaneously, be independent of each other between the two, illustrate herein and do not constitute limitation of the invention.
In a kind of embodiment, after getting configuration file, step S303A can be performed:
Refer to Fig. 3 a, in step S303A, indicate according to the field pretreatment in configuration file, advanced
Row data cleansing, then carry out data rewriting;Wherein step S303A may particularly include:
Step A, judge whether described data content exists rubbish field;
According to judged result, perform step A1 or step A2;
If step A1 exists rubbish field, then described rubbish field is deleted from described data content, and
Judge that the data content after deleting is the need of rewriting;
According to the judged result of step A1, perform step A11 or step A12;
Step A11, if desired rewrite, then the data content after described deletion is rewritten, after rewriting
Data content as pretreated data content;
If step A12 need not rewrite, then using the data content after described deletion as pretreated number
According to content;
If step A2 does not exist rubbish field, then judge that described data content is the need of rewriting;
According to the judged result of step A2, perform step A21 or step A22;
Step A21, if desired rewrite, then described data content is rewritten, by revised data
Hold as pretreated business datum;It is
If step A22 need not rewrite, then using described data content as pretreated data content.
In another kind of embodiment, after getting configuration file, step S303B can be performed:
Refer to Fig. 3 b, in step S303B, indicate according to the field pretreatment in configuration file, advanced
Row data rewriting, then carry out data cleansing;Wherein step S303B may particularly include:
B, judge that described data content is the need of rewriting;
According to judged result, perform step B1 or step B2;
B1, if desired rewrite, then described data content is rewritten, and judge in revised data
Whether appearance exists rubbish field;
According to the judged result of step B1, perform step B11 or step B12;
If B11 exists rubbish field, then described rubbish field is deleted from described revised data content
Removing, the data content after deleting is as pretreated data content;
If there is not rubbish field in B12, then using described revised data content as pretreated number
According to content;
If B2 need not rewrite, then judge whether described data content exists rubbish field;
According to the judged result of step B2, perform step B21 or step B22;
If B21 exists rubbish field, then described rubbish field is deleted from described data content, will delete
Data content after removing is as pretreated data content;
If there is not rubbish field in B22, then using described data content as pretreated data content.
Further, according to step S303A and step S303B, data cleansing purpose is divisor
According to the rubbish field in content, such as punctuation mark etc., these rubbish contents can affect follow-up retrieval and experience,
Therefore should remove;And the purpose of data rewriting is to need to carry out special handling due to data, as by some word
Sino-British mixing name in Duan is separated into two names etc., it is therefore desirable to generate advance row data at index data
Pretreatment operation.
Further preferred, described server can also include the dynamic base of an initial data pretreatment,
Mainly processing original business datum, the data after having processed are as the number of above-mentioned pretreatment operation
According to input, mainly including Data expansion, format checking etc., wherein Data expansion refers to what partial service pushed
Data are the most comprehensive, it is impossible to meet whole searching requirements of user, by capturing other resources in the Internet,
The data of supplementary service.As to video, the search of music, supplemented the number of a large amount of non-default system homegrown resources
According to;Format checking refers to that the data coming service propelling carry out correctness verification, check whether pushed and
Configuring the data type not being inconsistent and field etc., the process that original service data are processed by the present invention the most specifically limits
Fixed.
In step s 304, it is analyzed determining described data content to pretreated data content
Attribute information;
In some embodiments, preset word segment template can be obtained, according to described word segment template to described
Pretreated data content is analyzed, and determines the attribute information of described data content.Wherein, described clothes
Business device pre-sets multiple word-dividing mode, it may include the data template of multiple types of service, such as music
Data, then can include singer data base, song name data storehouse and school database, to it in data template
It is analyzed, then can learn the attribute information of this data content;Such as, if this data content belongs to music
Type of service, then attribute information refers to the attribute of the value types such as the download of song, playback volume.
In step S305, according to the instruction of described word segmentation processing and described attribute information, to described pretreatment
After business datum carry out participle, and the data content after word segmentation processing is carried out in-line arrangement process, generates unified
The in-line arrangement index file of data form.
After pretreatment, can indicate according to described attribute information and the word segmentation processing preset and process, i.e.
It is normalized operation, by the business datum of different-format, is normalized to unified data form, obtains
In-line arrangement data after normalization, to adapt to the data retrieval of multiple business type.
It is understood that after carrying out pretreatment, data can enter in-line arrangement processing unit FSU and carry out in-line arrangement
Index generates.Indicated by the word segmentation processing configured in configuration file, and according to built-in several participles
The search such as template carries out data process, calculates wordid, word POS information need to use the data message arrived,
Finally the in-line arrangement index file of consolidation form is exported.
It is understood that after generating the in-line arrangement index file of Uniform data format, it is also possible to including:
In step S306, described in-line arrangement index file is converted to inverted index file, in order to user according to
Described inverted index file is retrieved.
In conjunction with foregoing, carry out letter with the application scenarios index file to being generated by described method below
Single analysis:
It is understood that this generation method is system structure based on BS (browser, server),
This system supports that the data of multiple business type generate the index data of Uniform data format under identical platform.
First, this platform has realized pageization configuration, after access service data, needs to inform platform current business
Data have which data field, the type of each field and property value etc., implement and refer to second in fact
Execute the content about field configuration in example, be the most no longer described specifically.
Such as: for novel searching service, having six fields, wherein four fields are as textview field field
Need to set up index, have two fields to be supplied to dependency marking as Numerical Range field and use.Select to set up
The field of index will carry out semantic participle to each field, calculates wordid, finally sets up inverted index,
These fields are exactly the field that can be searched by user.
Wherein, participle mode defines when setting up text index, how the word in each field of cutting.Often
Have normal participle, prefix participle, classified index participle etc..
Normal participle carries out normal semantic participle exactly to text, such as " today, weather was the best ", can be divided
Become today/weather/true/tetra-words.Above-mentioned sentence is then divided into sky, sky the present/today/today/today by prefix participle
Gas/today weather is true/today very six words of weather, this participle mode is mainly used in associational word prompt facility.
Classified index participle is a kind of higher usage, can use for some texts having classification, as by little
Saying classifications such as being divided into swordsman, describing love affairs, science fiction, after using classified index participle to set up index, business just may be used
Inquire about according to the classification of novel, be the novel of science fiction as searched entitled " three bodies " and classification.
In preferred embodiment, it is also possible to the function retrieved in specified domain is provided.Such as: if merely desiring to search
Suo Shuming is the book of " three bodies ", and is not desired to search authors' name and comprises the book of " three bodies ", then can use
The function of retrieval in specified domain.Platform can increase special character in index entry, to indicate that this is a territory
The index entry of interior retrieval.Upon request by a user, also can retrieve plus corresponding mark, so can be straight
Connecting index asks the mode of friendship to exclude in other territories the document having " three bodies " this word to hit.
From the foregoing, in the present embodiment, according to the type of service of multiple business data, obtain corresponding
Configuration file, indicates according to the field pretreatment of configuration file thereafter, carries out the data content of business datum
Pretreatment, indicates according to the word segmentation processing of configuration file, pretreated data content carries out participle respectively
Process, thus generate the index file of Uniform data format.The present invention is directed to the data acquisition of different service types
With corresponding configuration file, data are processed, use thereafter identical program data content to be carried out point
Word, is normalized to the index data of Uniform data format by the business datum of different-format, thus can be for many
Planting the unified index file of setting up of traffic data type, simplification is set up process, is improved efficiency.
4th embodiment
The index file provided for ease of preferably implementing the embodiment of the present invention generates method, the embodiment of the present invention
Also provide for a kind of and above-mentioned index file and generate the index file generating means that method is corresponding.Wherein noun
Implication is identical with above-mentioned index file generation method, implements details and is referred in embodiment of the method
Explanation.
The structure referring to the index file generating means that Fig. 4, Fig. 4 provide for sixth embodiment of the invention is shown
Being intended to, wherein said device is system structure based on BS (browser, server), and user is by clear
Device of looking at uses this system, and this system supports that the data of multiple business type generate uniform data under identical platform
The index data of form.
Described device includes: first acquisition module the 401, second acquisition module 402, pretreatment module 403,
Word-dividing mode 404 and index generation module 405.
Wherein said first acquisition module 401, is used for obtaining business datum, and described business datum includes data
Content and type of service;Described second acquisition module 402, for obtaining corresponding according to described type of service
Configuration file, described configuration file includes indicating field pretreatment instruction and word segmentation processing.
It is it is understood that described type of service may include that video, music, picture etc., corresponding,
Described business datum can include video data, music data and image data etc., the most specifically limits
Fixed.It addition, the data form of the business datum in the present embodiment can be divided into two parts, one of them portion
Dividing the information of carrying service class indication, another part carries the data content that this type of service is corresponding.
Wherein, the corresponding a kind of configuration file of each type of service meeting, described configuration file is that user is according to reality
The feature of the type of service in the operation of border is pre-configured with and is stored in index file generating means.
Further, described configuration file contains the field to described data content and carry out the finger of pretreatment
Show, and the field of described data content carried out the instruction of participle, described configuration file according to user to respectively
The configuration of the field of business datum generates, and the configuration to field herein is not especially limited.
Described pretreatment module 403, for indicating according to described field pretreatment, enters described data content
Row pretreatment, generates pretreated data content;Described word-dividing mode 404, for according to described participle
Process instruction, described pretreated data content is carried out word segmentation processing respectively;Described index generation module
405, for the data content after word segmentation processing being carried out in-line arrangement process, generate the index literary composition of Uniform data format
Part.
Due to the corresponding configuration file of each type of service, the corresponding field pretreatment of the most each type of service refers to
Showing, each type of service according to corresponding field pretreatment instruction, carries out pre-place to described data content respectively
Reason, can embody the personalized differential operation between different service types;After pretreatment, can be according to platform
The word segment template preset and the word segmentation processing preset instruction process, and are i.e. normalized operation, will
The business datum of different-format, sends into in-line arrangement processing unit FSU and carries out in-line arrangement index generation, be normalized to system
The data form of one, has obtained the in-line arrangement data after normalization, to adapt to the data retrieval of multiple business type.
From the foregoing, in the present embodiment, according to the type of service of multiple business data, obtain corresponding
Configuration file, indicates according to the field pretreatment of configuration file thereafter, carries out the data content of business datum
Pretreatment, indicates according to the word segmentation processing of configuration file, pretreated data content carries out participle respectively
Process, thus generate the index file of Uniform data format.The present invention is directed to the data acquisition of different service types
With corresponding configuration file, data are processed, use thereafter identical program data content to be carried out point
Word, is normalized to the index data of Uniform data format by the business datum of different-format, thus can be for many
Planting the unified index file of setting up of traffic data type, simplification is set up process, is improved efficiency.
5th embodiment
The structure referring to the index file generating means that Fig. 5, Fig. 5 provide for fifth embodiment of the invention is shown
It is intended to.It should be noted that the index file generating means that the present invention provides is based on BS (browser, clothes
Business device) system structure, user uses this system by browser, this system support multiple business type
Data under identical platform, generate the index data of Uniform data format.
Wherein said index file generating means may include that first acquisition module the 501, second acquisition module
502, pretreatment module 503, word-dividing mode 504 and index generation module 505, it is to be understood that
In this embodiment, the function of above-mentioned each functional module can be corresponding with reference to the first acquisition mould in the 4th embodiment
Block the 401, second acquisition module 402, pretreatment module 403, word-dividing mode 404 and index generation module
The associated description of 405, does not repeats.
Described device also includes: configuration file generation module 506, before being used for obtaining business datum, respectively
Generate the configuration file corresponding to different service types.
It is understood that the corresponding a kind of configuration file of each type of service meeting, wherein, described service class
Type may include that video, music, picture etc., corresponding, described business datum include video data,
Music data and image data.In the present embodiment, described configuration file is that user is according in practical operation
The feature of type of service is pre-configured with and stores in the server, contains described in described configuration file
The field of data content carries out the instruction of pretreatment, and the field of described data content carries out the finger of participle
Show.
In a preferred embodiment, described configuration file can obtain based in the following manner:
Described configuration file generation module 506 includes: acquiring unit 5061 and dispensing unit 5062;
Wherein said acquiring unit 5061, for obtaining the field configuration information corresponding with type of service, described
The property value of multiple fields that the instruction of field configuration information is preset, described field includes textview field field, numerical value
Territory field and sorting field field.
It is understood that business datum of the present invention includes data content and type of service, in described data
Appearance includes multiple document, and document is made up of multiple fields, and wherein the type of field can carry out preset, bag
Include textview field field, Numerical Range field and sorting field field.It addition, each field includes that at least one belongs to
Property, it is possible to claiming configuration item, described property value is shown by the form of choice box, selects for user
And configuration.
Described dispensing unit 5062, is used for the instruction of the configuration information according to described field to the plurality of field
Property value configure, obtain the configuration file corresponding with described type of service.
Based on this, in further preferred embodiment, can come described many based on mode in detail below
The property value of individual field configures;Described dispensing unit 5062 may include that the first configuration subelement,
Two configuration subelements, the 3rd configuration subelement and generation subelement;
Described first configuration subelement, is used for the instruction of the configuration information according to described field to described textview field
The property value of the attribute of field configures, the textview field field after being configured, described textview field field
Attribute includes one or more the group in description, data length, major key, importance and participle mode
Close;
Wherein, the implication of each attribute of described textview field field is simply described as follows:
Describing the implication referring to this field references, play suggesting effect, Search Results is not affected by this attribute;
Data length refers to the greatest length of this field text.Whether divide more than 256 bytes according to field at present
Being two grades, greatest length is referred to as long text field more than the field of 256 bytes, wherein in whole textview field,
Only one of which field is configurable to long text field;
Major key i.e. major key, be used for unique field identifying a document, referred to as doc_id.This word
Section is necessary for changing into the value of numeral.Concrete, the value of doc_id is 64 integers;Due to this value
In the space of uint64_t uniformly should therefore preferably employ hash value etc. and produce, wherein, hash value
Being the numerical value obtained by logical operations according to data content, the hash value that different documents obtains is different,
Hash value has just become the identity card of each document;
Importance is the significance level representing text field, can be divided into important, general and inessential etc.;
Participle mode is divided into normal participle and prefix participle.Wherein, normal participle refers to according to natural semanteme
Text is carried out participle, generally can give tacit consent to selection which;Prefix participle is applicable to search box prompting
The scene of combobox.
Described second configuration subelement, is used for the instruction of the configuration information according to described field to described Numerical Range
The property value of the attribute of field configures, the Numerical Range field after being configured, described Numerical Range field
Attribute includes one or more the combination in description, data type, authority, importance, major key;
Described Numerical Range field is applicable to the information of value type.Such as price, download etc..In this field
String value must can be converted into numeral.Wherein, the implication of each attribute of described Numerical Range field is briefly
Bright as follows:
Describing the implication referring to this field references, play suggesting effect, Search Results is not affected by this attribute;
Data type is that in this embodiment, configuration item can be provided with int8, uint8, int16, uint16, int32,
Uint32, int64, uint64 and float several types is available.User is according to the possible maximum of this numerical value
Scope selects, provided that data in actual value exceed the scope of configuration, it will make mistakes;
Authority is used for representing that this field can embody the authority of this document.Such as, for video search,
Can select to watch number as authoritative field.Only 0 or 1 Numerical Range field can be appointed as authority
Field;
Importance is the significance level representing this field, can be divided into important, general and inessential etc.;
Major key is identical with the major key definition of textview field field, also refers to major key, is used for unique mark one
The field of document.It is referred to as doc_id.Wherein, this field is necessary for changing into the value of numeral.Concrete,
The value of doc_id is 64 integers;Owing to this value should be in the space of uint64_t uniformly the most excellent
Choosing uses hash value etc. to produce.
Described 3rd configuration subelement, is used for the instruction of the configuration information according to described field to described sorting field
The attribute of field configures, the sorting field field after being configured, and the attribute of described sorting field field includes
Classification is specified in retrieval;Described generation subelement, after according to the textview field field after described configuration, configuration
Numerical Range field generate the configuration file corresponding with described type of service with the sorting field field after configuration.
It is further preferred that described pretreatment module 503 can refer to according to the field pretreatment in configuration file
Show and data content carried out pretreatment, described data content is carried out pretreatment mainly include data cleansing and
Data rewriting, wherein the execution sequence for data cleansing and data rewriting is not construed as limiting, the most both can be first
Carry out data cleansing, then carry out data rewriting, it is also possible to advanced row data rewriting, then carry out data cleansing,
Both can also perform simultaneously, be independent of each other between the two, illustrate herein and do not constitute limitation of the invention.
Based on this, in a kind of embodiment, the advanced row data cleansing of described pretreatment module 503, then
Carrying out data rewriting, described pretreatment module 503 may include that the first judging unit 5031, first processes
Unit 5032 and the second processing unit 5033;
Wherein, described first judging unit 5031, it is used for judging whether described data content exists rubbish word
Section;
, if for there is rubbish field, then by described rubbish field from described in described first processing unit 5032
Data content is deleted, and judges that the data content after deleting, the need of rewriting, is if desired rewritten, then will
Data content after described deletion is rewritten, using revised data content as in pretreated data
Hold;If need not rewrite, then using the data content after described deletion as pretreated data content;
Described second processing unit 5033, if for there is not rubbish field, then judging that described data content is
No needs is rewritten, and if desired rewrites, is then rewritten by described data content, by revised data content
As pretreated business datum;If need not rewrite, then using described data content as pretreated
Data content.
In another kind of embodiment, the advanced row data rewriting of described pretreatment module 503, then count
According to cleaning, described pretreatment module 503 may include that the second judging unit the 5034, the 3rd processing unit 5035
And fourth processing unit 5036;
Wherein, described second judging unit 5034, it is used for judging that described data content is the need of rewriting;
Described 3rd processing unit 5035, for if desired rewriting, then rewrites described data content,
And judge whether revised data content exists rubbish field, if there is rubbish field, then by described
Rubbish field is deleted from described revised data content, and the data content after deleting is as after pretreatment
Data content, if there is not rubbish field, then using described revised data content as pretreated
Data content;
Whether described fourth processing unit 5036, if rewriting for need not, then judge in described data content
There is rubbish field, if there is rubbish field, then described rubbish field being deleted from described data content,
Data content after deleting is as pretreated data content, if there is not rubbish field, then by described
Data content is as pretreated data content.
Further, described word-dividing mode 504 may include that attribute information determines unit, for institute
State pretreated data content and be analyzed determining the attribute information of described data content;Participle unit,
For according to the instruction of described word segmentation processing and described attribute information, described pretreated business datum being entered
Row participle, generates the data content after word segmentation processing.
In some embodiments, described attribute information determines that unit may include that acquisition subelement, is used for
Obtain preset word segment template;Determine subelement, be used for according to described word segment template described pretreated
Data content is analyzed, and determines the attribute information of described data content.Wherein, in described server in advance
It is provided with multiple word-dividing mode, it may include the data template of multiple types of service, such as the data of music, then counts
According to template can include singer data base, song name data storehouse and school database, it is analyzed,
Then can learn the attribute information of this data content.
After pretreatment, can indicate according to described attribute information and the word segmentation processing preset and process, i.e.
It is normalized operation, by the business datum of different-format, is normalized to unified data form, obtains
In-line arrangement data after normalization, to adapt to the data retrieval of multiple business type.
It is understood that after carrying out pretreatment, data can enter in-line arrangement processing unit FSU and carry out in-line arrangement
Index generates.Indicated by the word segmentation processing configured in configuration file, and according to built-in several participles
The search such as template carries out data process, calculates wordid, word POS information need to use the data message arrived,
Finally the in-line arrangement index file of consolidation form is exported.
It is understood that after generating the in-line arrangement index file of Uniform data format, described device also may be used
To include: modular converter 507, for described in-line arrangement index file is converted to inverted index file, in order to
User retrieves according to described inverted index file.
From the foregoing, in the present embodiment, according to the type of service of multiple business data, obtain corresponding
Configuration file, indicates according to the field pretreatment of configuration file thereafter, carries out the data content of business datum
Pretreatment, indicates according to the word segmentation processing of configuration file, pretreated data content carries out participle respectively
Process, thus generate the index file of Uniform data format.The present invention is directed to the data acquisition of different service types
With corresponding configuration file, data are processed, use thereafter identical program data content to be carried out point
Word, is normalized to the index data of Uniform data format by the business datum of different-format, thus can be for many
Planting the unified index file of setting up of traffic data type, simplification is set up process, is improved efficiency.
Sixth embodiment
The embodiment of the present invention also provides for a kind of server, wherein can be with the index file of the integrated embodiment of the present invention
Generating means, as shown in Figure 6, it illustrates the structural representation of server involved by the embodiment of the present invention,
Specifically:
This server can include one or the processor 601, or of more than one process core
The memorizer 602 of above computer-readable recording medium, radio frequency (Radio Frequency, RF) circuit 603,
The parts such as power supply 604, input block 605 and display unit 606.It will be understood by those skilled in the art that
Server architecture shown in Fig. 6 is not intended that the restriction to server, can include more more or more than diagram
Few parts, or combine some parts, or different parts are arranged.Wherein:
Processor 601 is the control centre of this server, utilizes various interface and the whole server of connection
Various piece, by run or perform be stored in the software program in memorizer 602 and/or module, and
Call the data being stored in memorizer 602, perform the various functions of server and process data, thus right
Server carries out integral monitoring.Optionally, processor 601 can include one or more process core;Preferably
, processor 601 can integrated application processor and modem processor, wherein, application processor is main
Processing operating system, user interface and application program etc., modem processor mainly processes radio communication.
It is understood that above-mentioned modem processor can not also be integrated in processor 601.
Memorizer 602 can be used for storing software program and module, and processor 601 is stored in by operation
The software program of reservoir 602 and module, thus perform the application of various function and data process.Memorizer
602 can mainly include store program area and storage data field, wherein, storage program area can store operating system,
Application program (such as sound-playing function, image player function etc.) etc. needed at least one function;Deposit
Storage data field can store the data etc. that the use according to server is created.Additionally, memorizer 602 can wrap
Include high-speed random access memory, it is also possible to include nonvolatile memory, for example, at least one disk storage
Device, flush memory device or other volatile solid-state parts.Correspondingly, memorizer 602 can also wrap
Include Memory Controller, to provide the processor 601 access to memorizer 602.
During RF circuit 603 can be used for receiving and sending messages, the reception of signal and transmission, especially, by base station
Downlink information receive after, transfer to one or more than one processor 601 process;It addition, will relate to
The data of row are sent to base station.Generally, RF circuit 603 include but not limited to antenna, at least one amplifier,
Tuner, one or more agitator, subscriber identity module (SIM) card, transceiver, bonder,
Low-noise amplifier (LNA, LowNoise Amplifier), duplexer etc..Additionally, RF circuit 603
Can also be communicated with network and other equipment by radio communication.Described radio communication can use arbitrary communication
Standard or agreement, include but not limited to global system for mobile communications (GSM, Global System ofMobile
Communication), general packet radio service (GPRS, General PacketRadio Service),
CDMA (CDMA, Code DivisionMultiple Access), WCDMA (WCDMA,
Wideband Code Division Multiple Access), Long Term Evolution (LTE, Long Term
Evolution), Email, Short Message Service (SMS, ShortMessaging Service) etc..
Server also includes the power supply 604 (such as battery) powered to all parts, it is preferred that power supply can
With logically contiguous with processor 601 by power-supply management system, thus realize management by power-supply management system
The functions such as charging, electric discharge and power managed.Power supply 604 can also include one or more directly
Stream or alternating current power supply, recharging system, power failure detection circuit, power supply changeover device or inverter, electricity
The random component such as source positioning indicator.
This server may also include input block 605, and this input block 605 can be used for receiving the numeral of input
Or character information, and produce the keyboard relevant with user setup and function control, mouse, action bars,
Optics or the input of trace ball signal.
This server may also include display unit 606, and this display unit 606 can be used for display and inputted by user
Information or be supplied to the information of user and the various graphical user interface of server, these graphical users connect
Mouth can be made up of figure, text, icon, video and its combination in any.Display unit 608 can include
Display floater, optionally, can use liquid crystal display (LCD, Liquid Crystal Display),
The forms such as Organic Light Emitting Diode (OLED, Organic Light-Emitting Diode) configure display surface
Plate.
Concrete the most in the present embodiment, the processor 601 in server can according to following instruction, by one or
The executable file that the process of more than one application program is corresponding is loaded in memorizer 602, and by processing
Device 601 runs storage application program in the memory 602, thus realizes various function, as follows:
Obtaining business datum, described business datum includes data content and type of service;According to described service class
Type obtains corresponding configuration file, and described configuration file includes field pretreatment instruction and word segmentation processing
Instruction;Indicate according to described field pretreatment, described data content is carried out pretreatment, after generating pretreatment
Data content;Indicate according to described word segmentation processing, described pretreated data content is carried out respectively point
Word processes;Data content after word segmentation processing is carried out in-line arrangement process, generates the index literary composition of Uniform data format
Part.
Preferably, described processor 601 is additionally operable to: generate the configuration literary composition corresponding to different service types respectively
Part.
Further, obtaining the field configuration information corresponding with type of service, described field configuration information indicates
The property value of preset multiple fields, described field includes textview field field, Numerical Range field and sorting field
Field;The property value of the plurality of field is configured by the instruction of the configuration information according to described field,
To the configuration file corresponding with described type of service.
Preferably, described processor 601 is additionally operable to: judge whether to exist in described data content rubbish field;
If there is rubbish field, then described rubbish field is deleted from described data content, and judge to delete
After data content the need of rewriting, if desired rewrite, then the data content after described deletion changed
Write, using revised data content as pretreated data content;If need not rewrite, then by described
Data content after deletion is as pretreated data content;
If there is not rubbish field, then judge that described data content, the need of rewriting, is if desired rewritten, then
Described data content is rewritten, using revised data content as pretreated business datum;If
Need not rewrite, then using described data content as pretreated data content.
Preferably, described processor 601 is additionally operable to: judge that described data content is the need of rewriting;
If desired rewrite, then described data content is rewritten, and judge in revised data content
Whether there is rubbish field, if there is rubbish field, then by described rubbish field from described revised data
Deleting in content, the data content after deleting is as pretreated data content, if there is not rubbish word
Section, then using described revised data content as pretreated data content;
If need not rewrite, then judge whether described data content exists rubbish field, if there is rubbish word
Section, then delete described rubbish field from described data content, and the data content after deleting is as pre-place
, if there is not rubbish field, then using described data content as pretreated data in the data content after reason
Content.
Preferably, described processor 601 is additionally operable to:
The property value of the attribute of described textview field field is joined by the instruction of the configuration information according to described field
Put, the textview field field after being configured, the attribute of described textview field field include description, data length,
One or more combination in major key, importance and participle mode;
The property value of the attribute of described Numerical Range field is joined by the instruction of the configuration information according to described field
Put, the Numerical Range field after being configured, the attribute of described Numerical Range field include description, data type,
One or more combination in authority, importance, major key;
The attribute of described sorting field field is configured by the instruction of the configuration information according to described field, obtains
Sorting field field after configuration, the attribute of described sorting field field includes that classification is specified in retrieval;
According to the Numerical Range field after the textview field field after described configuration, configuration and the sorting field word after configuration
The configuration file that Duan Shengcheng is corresponding with described type of service.
Preferably, described processor 601 is additionally operable to: described pretreated data content is analyzed with
Determine the attribute information of described data content;According to the instruction of described word segmentation processing and described attribute information, right
Described pretreated business datum carries out participle, generates the data content after word segmentation processing.
Further, described in-line arrangement index file is converted to inverted index file, in order to user is according to described
Inverted index file is retrieved.
Preferably, described processor 601 is additionally operable to: obtain preset word segment template;According to described participle mould
Described pretreated data content is analyzed by plate, determines the attribute information of described data content.
It is understood that in the above-described embodiment, the description to each embodiment all emphasizes particularly on different fields, certain
Individual embodiment does not has the part described in detail, may refer to index file corresponding above and generate retouching in detail of method
Stating, here is omitted.
From the foregoing, the server that the present embodiment provides, according to the type of service of multiple business data, obtain
Take corresponding configuration file, indicate according to the field pretreatment of configuration file thereafter, the number to business datum
Carry out pretreatment according to content, indicate according to the word segmentation processing of configuration file, pretreated data content is divided
Do not carry out word segmentation processing, thus generate the index file of Uniform data format.The present invention is directed to different business class
Data are processed by the data acquisition of type with corresponding configuration file, use thereafter identical program to data
Content carries out participle, and the business datum of different-format is normalized to the index data of Uniform data format, from
And can simplify, for the unified index file of setting up of multiple business data type, process of setting up, improve efficiency.
The embodiment of the present invention provide described index file generating means, be such as computer, panel computer,
Mobile phone with touch function etc., the rope that described index file generating means is corresponding with foregoing embodiments
Draw document generating method and belong to same design, described index file generating means can corresponding be run described
Index file generates the either method provided in embodiment of the method, and it implements process and refers to the described of correspondence
Index file generates embodiment of the method, and here is omitted.
It should be noted that for index file generation method of the present invention, this area common test people
Member is appreciated that realizing index file described in the embodiment of the present invention generates all or part of flow process of method, is can
Completing with the hardware controlling to be correlated with by computer program, described computer program can be stored in a calculating
In machine read/write memory medium, as being stored in the memorizer of terminal, and by least one in this terminal
Reason device performs, and can include the flow process generating the embodiment of method such as described index file in the process of implementation.Its
In, described storage medium can be magnetic disc, CD, read only memory (ROM, Read Only Memory),
Random access memory (RAM, RandomAccess Memory) etc..
For the index file generating means of the embodiment of the present invention, its each functional module can be integrated in respectively
One processes in chip, it is also possible to be that modules is individually physically present, it is also possible to two or more moulds
Block is integrated in a module.Above-mentioned integrated module both can realize to use the form of hardware, it is also possible to adopts
Realize by the form of software function module.If described integrated module realizes with the form of software function module
And during as independent production marketing or use, it is also possible to it is stored in a computer read/write memory medium,
Described storage medium is such as read only memory, disk or CD etc..
A kind of index file provided the embodiment of the present invention above generates method and device and has carried out detailed Jie
Continuing, principle and the embodiment of the present invention are set forth by specific case used herein, above enforcement
The explanation of example is only intended to help to understand method and the core concept thereof of the present invention;Simultaneously for this area
Technical staff, according to the thought of the present invention, the most all will change,
In sum, this specification content should not be construed as limitation of the present invention.
Claims (18)
1. an index file generates method, it is characterised in that described method includes:
Obtaining business datum, described business datum includes data content and type of service;
Obtaining corresponding configuration file according to described type of service, described configuration file includes place pre-to field
Reason instruction and word segmentation processing instruction;
Indicate according to described field pretreatment, described data content is carried out pretreatment, generates pretreated
Data content;
Indicate according to described word segmentation processing, described pretreated data content is carried out word segmentation processing respectively;
Data content after word segmentation processing is carried out in-line arrangement process, generates the index file of Uniform data format.
Index file the most according to claim 1 generates method, it is characterised in that described acquisition business
Before data, also include:
Generate the configuration file corresponding to different service types respectively.
Index file the most according to claim 2 generates method, it is characterised in that described generate respectively
Corresponding to the configuration file of different service types, including:
Obtain the field configuration information corresponding with type of service, preset multiple of described field configuration information instruction
The property value of field, described field includes textview field field, Numerical Range field and sorting field field;
The property value of the plurality of field is configured by the instruction of the configuration information according to described field, obtains
The configuration file corresponding with described type of service.
4. generate method according to the index file described in any one of claims 1 to 3, it is characterised in that institute
State and indicate according to described field pretreatment, described data content is carried out pretreatment, generates pretreated number
According to content, including:
Judge whether described data content exists rubbish field;
If there is rubbish field, then described rubbish field is deleted from described data content, and judge to delete
After data content the need of rewriting, if desired rewrite, then the data content after described deletion changed
Write, using revised data content as pretreated data content;If need not rewrite, then by described
Data content after deletion is as pretreated data content;
If there is not rubbish field, then judge that described data content, the need of rewriting, is if desired rewritten, then
Described data content is rewritten, using revised data content as pretreated business datum;If
Need not rewrite, then using described data content as pretreated data content.
5. generate method according to the index file described in any one of claims 1 to 3, it is characterised in that institute
State and indicate according to described field pretreatment, described data content is carried out pretreatment, generates pretreated number
According to content, including:
Judge that described data content is the need of rewriting;
If desired rewrite, then described data content is rewritten, and judge in revised data content
Whether there is rubbish field, if there is rubbish field, then by described rubbish field from described revised data
Deleting in content, the data content after deleting is as pretreated data content, if there is not rubbish word
Section, then using described revised data content as pretreated data content;
If need not rewrite, then judge whether described data content exists rubbish field, if there is rubbish word
Section, then delete described rubbish field from described data content, and the data content after deleting is as pre-place
, if there is not rubbish field, then using described data content as pretreated data in the data content after reason
Content.
Index file the most according to claim 3 generates method, it is characterised in that described in described basis
The property value of the plurality of field is configured by the instruction of the configuration information of field, obtains and described service class
The configuration file that type is corresponding, including:
The property value of the attribute of described textview field field is joined by the instruction of the configuration information according to described field
Put, the textview field field after being configured, the attribute of described textview field field include description, data length,
One or more combination in major key, importance and participle mode;
The property value of the attribute of described Numerical Range field is joined by the instruction of the configuration information according to described field
Put, the Numerical Range field after being configured, the attribute of described Numerical Range field include description, data type,
One or more combination in authority, importance, major key;
The attribute of described sorting field field is configured by the instruction of the configuration information according to described field, obtains
Sorting field field after configuration, the attribute of described sorting field field includes that classification is specified in retrieval;
According to the Numerical Range field after the textview field field after described configuration, configuration and the sorting field word after configuration
The configuration file that Duan Shengcheng is corresponding with described type of service.
7. generate method according to the index file described in any one of claims 1 to 3, it is characterised in that institute
State and indicate according to described word segmentation processing, described pretreated data content is carried out respectively the step of word segmentation processing
Suddenly, including:
Described pretreated data content is analyzed determining the attribute information of described data content;
According to the instruction of described word segmentation processing and described attribute information, described pretreated business datum is entered
Row participle, generates the data content after word segmentation processing.
Index file the most according to claim 7 generate method, it is characterised in that described to participle at
Data content after reason carries out in-line arrangement process, after generating the in-line arrangement index file of Uniform data format, also wraps
Include:
Described in-line arrangement index file is converted to inverted index file, in order to user is according to described inverted index literary composition
Part is retrieved.
Index file the most according to claim 7 generates method, it is characterised in that described to described pre-
Data content after process is analyzed determining the attribute information of described data content, including:
Obtain preset word segment template;
According to described word segment template, described pretreated data content is analyzed, in determining described data
The attribute information held.
10. an index file generating means, it is characterised in that described device includes:
First acquisition module, is used for obtaining business datum, and described business datum includes data content and service class
Type;
Second acquisition module, for obtaining corresponding configuration file, described configuration according to described type of service
File includes indicating field pretreatment instruction and word segmentation processing;
Pretreatment module, for indicating according to described field pretreatment, carries out pretreatment to described data content,
Generate pretreated data content;
Word-dividing mode, for indicating according to described word segmentation processing, to described pretreated data content respectively
Carry out word segmentation processing;
Index generation module, for the data content after word segmentation processing carries out in-line arrangement process, generates unified number
Index file according to form.
11. index file generating means according to claim 10, it is characterised in that described device is also
Including: configuration file generation module, before being used for obtaining business datum, generate respectively corresponding to different business
The configuration file of type.
12. index file generating means according to claim 11, it is characterised in that described configuration literary composition
Part generation module includes:
Acquiring unit, for obtaining the field configuration information corresponding with type of service, described field configuration information
Indicating the property value of preset multiple fields, described field includes textview field field, Numerical Range field and divides
One or more combination in class field field;
Dispensing unit, be used for the configuration information according to described field indicates the property value to the plurality of field
Configure, obtain the configuration file corresponding with described type of service.
13. according to the index file generating means described in any one of claim 10 to 12, it is characterised in that
Described pretreatment module, including:
First judging unit, is used for judging whether to exist in described data content rubbish field;
, if for there is rubbish field, then by described rubbish field from described data content in the first processing unit
Middle deletion, and judge that the data content after deleting, the need of rewriting, is if desired rewritten, then by described deletion
After data content rewrite, using revised data content as pretreated data content;If no
Need to rewrite, then using the data content after described deletion as pretreated data content;
Second processing unit, if for there is not rubbish field, then judging that described data content is the need of changing
Write, if desired rewrite, then described data content is rewritten, using revised data content as pre-place
Business datum after reason;If need not rewrite, then using described data content as pretreated data content.
14. according to the index file generating means described in any one of claim 10 to 12, it is characterised in that
Described pretreatment module, including:
Second judging unit, is used for judging that described data content is the need of rewriting;
3rd processing unit, for if desired rewriting, then rewrites described data content, and judge by
Whether revised data content exists rubbish field, if there is rubbish field, then by described rubbish field
Deleting from described revised data content, the data content after deleting is as in pretreated data
Hold, if there is not rubbish field, then using described revised data content as pretreated data content;
Fourth processing unit, if for need not rewriting, then judging whether there is rubbish in described data content
Field, if there is rubbish field, then deletes described rubbish field from described data content, after deleting
Data content as pretreated data content, if there is not rubbish field, then by described data content
As pretreated data content.
15. index file generating means according to claim 12, it is characterised in that described configuration list
Unit, including:
First configuration subelement, is used for the instruction of the configuration information according to described field to described textview field field
The property value of attribute configure, the textview field field after being configured, the attribute of described textview field field
Including one or more the combination in description, data length, major key, importance and participle mode;
Second configuration subelement, is used for the instruction of the configuration information according to described field to described Numerical Range field
The property value of attribute configure, the Numerical Range field after being configured, the attribute of described Numerical Range field
Including one or more the combination in description, data type, authority, importance, major key;
3rd configuration subelement, is used for the instruction of the configuration information according to described field to described sorting field field
Attribute configure, the sorting field field after being configured, the attribute of described sorting field field include retrieval
Specify classification;
Generate subelement, for according to the textview field field after described configuration, configuration after Numerical Range field and
Sorting field field after configuration generates the configuration file corresponding with described type of service.
16. according to the index file generating means described in any one of claim 10 to 12, it is characterised in that
Described word-dividing mode, including:
Attribute information determines unit, described for being analyzed described pretreated data content determining
The attribute information of data content;
Participle unit, for according to the instruction of described word segmentation processing and described attribute information, to described pretreatment
After business datum carry out participle, generate the data content after word segmentation processing.
17. index file generating means according to claim 16, it is characterised in that described device is also
Including:
Modular converter, for being converted to inverted index file by described in-line arrangement index file, in order to user according to
Described inverted index file is retrieved.
18. index file generating means according to claim 16, it is characterised in that described attribute is believed
Breath determines unit, including:
Obtain subelement, for obtaining preset word segment template;
Determine subelement, for described pretreated data content being analyzed according to described word segment template,
Determine the attribute information of described data content.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510039519.5A CN105988996B (en) | 2015-01-27 | 2015-01-27 | Index file generation method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510039519.5A CN105988996B (en) | 2015-01-27 | 2015-01-27 | Index file generation method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105988996A true CN105988996A (en) | 2016-10-05 |
CN105988996B CN105988996B (en) | 2020-04-10 |
Family
ID=57034424
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510039519.5A Active CN105988996B (en) | 2015-01-27 | 2015-01-27 | Index file generation method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105988996B (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107179953A (en) * | 2017-03-31 | 2017-09-19 | 北京奇艺世纪科技有限公司 | A kind of index file generation method, apparatus and system |
CN107256206A (en) * | 2017-05-24 | 2017-10-17 | 北京京东尚科信息技术有限公司 | The method and apparatus of character stream format conversion |
CN108062297A (en) * | 2017-11-22 | 2018-05-22 | 万兴科技股份有限公司 | A kind of creation method, creating device and the terminal device of pdf document textview field |
CN108241713A (en) * | 2016-12-27 | 2018-07-03 | 南京烽火软件科技有限公司 | A kind of inverted index search method based on polynary cutting |
CN109241098A (en) * | 2018-08-08 | 2019-01-18 | 南京中新赛克科技有限责任公司 | A kind of enquiring and optimizing method of distributed data base |
CN109327321A (en) * | 2017-08-01 | 2019-02-12 | 中兴通讯股份有限公司 | Network model business executes method, apparatus, SDN controller and readable storage medium storing program for executing |
CN109783444A (en) * | 2018-12-26 | 2019-05-21 | 亚信科技(中国)有限公司 | Multichannel file index method, device, computer equipment and storage medium |
CN110427368A (en) * | 2019-07-12 | 2019-11-08 | 深圳绿米联创科技有限公司 | Data processing method, device, electronic equipment and storage medium |
CN110489417A (en) * | 2019-07-25 | 2019-11-22 | 深圳壹账通智能科技有限公司 | A kind of data processing method and relevant device |
CN110990126A (en) * | 2019-12-12 | 2020-04-10 | 北京明略软件系统有限公司 | Method and device for realizing shortcut front-end service page based on js |
CN113468393A (en) * | 2021-06-09 | 2021-10-01 | 北京达佳互联信息技术有限公司 | Index generation method and device, electronic equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102567418A (en) * | 2010-12-23 | 2012-07-11 | 北大方正集团有限公司 | Methods and devices for integrating and searching data |
US20140032703A1 (en) * | 2008-05-30 | 2014-01-30 | Matthew A. Wormley | System and method for an expandable computer storage system |
CN103823799A (en) * | 2012-11-16 | 2014-05-28 | 镇江诺尼基智能技术有限公司 | New-generation industry knowledge full-text search method |
CN104199977A (en) * | 2014-09-24 | 2014-12-10 | 浪潮软件股份有限公司 | Method for creating information search based on data in database |
-
2015
- 2015-01-27 CN CN201510039519.5A patent/CN105988996B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140032703A1 (en) * | 2008-05-30 | 2014-01-30 | Matthew A. Wormley | System and method for an expandable computer storage system |
CN102567418A (en) * | 2010-12-23 | 2012-07-11 | 北大方正集团有限公司 | Methods and devices for integrating and searching data |
CN103823799A (en) * | 2012-11-16 | 2014-05-28 | 镇江诺尼基智能技术有限公司 | New-generation industry knowledge full-text search method |
CN104199977A (en) * | 2014-09-24 | 2014-12-10 | 浪潮软件股份有限公司 | Method for creating information search based on data in database |
Non-Patent Citations (1)
Title |
---|
徐树振 等: "企业非结构化数据检索研究", 《信息技术》 * |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108241713B (en) * | 2016-12-27 | 2021-12-28 | 南京烽火星空通信发展有限公司 | Inverted index retrieval method based on multi-element segmentation |
CN108241713A (en) * | 2016-12-27 | 2018-07-03 | 南京烽火软件科技有限公司 | A kind of inverted index search method based on polynary cutting |
CN107179953B (en) * | 2017-03-31 | 2020-04-03 | 北京奇艺世纪科技有限公司 | Index file generation method, device and system |
CN107179953A (en) * | 2017-03-31 | 2017-09-19 | 北京奇艺世纪科技有限公司 | A kind of index file generation method, apparatus and system |
CN107256206A (en) * | 2017-05-24 | 2017-10-17 | 北京京东尚科信息技术有限公司 | The method and apparatus of character stream format conversion |
CN107256206B (en) * | 2017-05-24 | 2021-04-30 | 北京京东尚科信息技术有限公司 | Method and device for converting character stream format |
CN109327321B (en) * | 2017-08-01 | 2021-10-15 | 中兴通讯股份有限公司 | Network model service execution method and device, SDN controller and readable storage medium |
CN109327321A (en) * | 2017-08-01 | 2019-02-12 | 中兴通讯股份有限公司 | Network model business executes method, apparatus, SDN controller and readable storage medium storing program for executing |
CN108062297A (en) * | 2017-11-22 | 2018-05-22 | 万兴科技股份有限公司 | A kind of creation method, creating device and the terminal device of pdf document textview field |
CN108062297B (en) * | 2017-11-22 | 2021-06-15 | 深圳市亿图软件有限公司 | PDF file text field creating method and device and terminal equipment |
CN109241098A (en) * | 2018-08-08 | 2019-01-18 | 南京中新赛克科技有限责任公司 | A kind of enquiring and optimizing method of distributed data base |
CN109241098B (en) * | 2018-08-08 | 2022-02-18 | 南京中新赛克科技有限责任公司 | Query optimization method for distributed database |
CN109783444A (en) * | 2018-12-26 | 2019-05-21 | 亚信科技(中国)有限公司 | Multichannel file index method, device, computer equipment and storage medium |
CN110427368A (en) * | 2019-07-12 | 2019-11-08 | 深圳绿米联创科技有限公司 | Data processing method, device, electronic equipment and storage medium |
CN110427368B (en) * | 2019-07-12 | 2022-07-12 | 深圳绿米联创科技有限公司 | Data processing method and device, electronic equipment and storage medium |
WO2021012553A1 (en) * | 2019-07-25 | 2021-01-28 | 深圳壹账通智能科技有限公司 | Data processing method and related device |
CN110489417A (en) * | 2019-07-25 | 2019-11-22 | 深圳壹账通智能科技有限公司 | A kind of data processing method and relevant device |
CN110489417B (en) * | 2019-07-25 | 2023-03-28 | 深圳壹账通智能科技有限公司 | Data processing method and related equipment |
CN110990126A (en) * | 2019-12-12 | 2020-04-10 | 北京明略软件系统有限公司 | Method and device for realizing shortcut front-end service page based on js |
CN113468393A (en) * | 2021-06-09 | 2021-10-01 | 北京达佳互联信息技术有限公司 | Index generation method and device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN105988996B (en) | 2020-04-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105988996A (en) | Index file generation method and device | |
CN103353899B (en) | The accurate searching method of a kind of integrated information | |
US20110179061A1 (en) | Extraction and Publication of Reusable Organizational Knowledge | |
CN101578617A (en) | Method, apparatus and computer program product for making semantic annotations for easy file organization and search | |
CN104516902A (en) | Semantic information acquisition method and corresponding keyword extension method and search method | |
CN102665231B (en) | Method of automatically generating parameter configuration file for LTE (Long Term Evolution) system | |
CN106201890B (en) | The performance optimization method and server of a kind of application | |
CN107305527B (en) | Code file processing method and device | |
CN111651453B (en) | User history behavior query method and device, electronic equipment and storage medium | |
CN107391509A (en) | Label recommendation method and device | |
CN101281430A (en) | Apparatus with expression symbol associating input function and associating input method | |
CN110390569A (en) | A kind of content promotion method, device and storage medium | |
CN105677148A (en) | Terminal application searching method and device | |
CN110928917A (en) | Target user determination method and device, computing equipment and medium | |
CN104834759A (en) | Realization method and device for electronic design | |
CN104412262A (en) | Method and apparatus for providing task-based service recommendations | |
CN110069769A (en) | Using label generating method, device and storage equipment | |
CN107423291A (en) | A kind of data translating method and client device | |
CN109003012B (en) | Goods location recommendation link information acquisition method, goods location recommendation method, device and system | |
CN106201198B (en) | Lookup method, device and the mobile terminal of terminal applies | |
CN110489032B (en) | Dictionary query method for electronic book and electronic equipment | |
US10289740B2 (en) | Computer systems to outline search content and related methods therefor | |
CN104424300A (en) | Personalized search suggestion method and device | |
CN105991312B (en) | A kind of rearrangement and device of Internet resources | |
CN110399337B (en) | File automation service method and system based on data driving |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |