CN105988996A - Index file generation method and device - Google Patents

Index file generation method and device Download PDF

Info

Publication number
CN105988996A
CN105988996A CN201510039519.5A CN201510039519A CN105988996A CN 105988996 A CN105988996 A CN 105988996A CN 201510039519 A CN201510039519 A CN 201510039519A CN 105988996 A CN105988996 A CN 105988996A
Authority
CN
China
Prior art keywords
field
data content
data
configuration
index file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510039519.5A
Other languages
Chinese (zh)
Other versions
CN105988996B (en
Inventor
朱锴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201510039519.5A priority Critical patent/CN105988996B/en
Publication of CN105988996A publication Critical patent/CN105988996A/en
Application granted granted Critical
Publication of CN105988996B publication Critical patent/CN105988996B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses an index file generation method and device. The index file generation method includes: acquiring service data, wherein the service data includes data content and a service type; acquiring a corresponding configuration file according to the service type, wherein the configuration file includes a field pretreatment instruction and a segmentation treatment instruction; performing pretreatment on the data content according to the field pretreatment instruction to generate the pretreated data content; and performing segmentation treatment on the pretreated data content according to the segmentation treatment instruction, and performing ranking treatment on the data content to generate an index file having an unified data format after segmentation treatment. The index file generation method and device can uniformly establish index files for various service types of data, can simplify an establishment process, and can improve the efficiency.

Description

A kind of index file generates method and device
Technical field
The invention belongs to communication technical field, particularly relate to a kind of index file and generate method and device.
Background technology
Along with developing rapidly of computer and Internet technology, the quantity of information stored in the network device is also got over It is more huge, for the ease of these information are inquired about, generally requires by setting up the sides such as index file Formula assists user to conduct interviews these information.
In the prior art, it is typically necessary the type of service of data carrying out as required retrieving and generates correspondence In-line arrangement index file, then this in-line arrangement index file is carried out the process of falling row, obtains inverted index file, So that the data of this type of service are retrieved by user according to this inverted index file.And for different business The data of type, owing to the factors such as the keyword that it is involved are different, so, in the prior art, for The data of different service types, need independently to set up an index generation system, enter for user to generate to index Line retrieval.
To in the research of prior art and practice process, it was found by the inventors of the present invention that the rope of existing scheme Derivation become system can only for a kind of type of service, so, under the scene that type of service is more, need to take Build many lasso tricks derivation and become system, and the professional standards to operator of setting up of this system require higher, whole The process of individual foundation is more time-consuming, and efficiency is low.
Summary of the invention
It is an object of the invention to provide a kind of index file and generate method and device, can be for multiple business number Set up index file according to type, simplify process of setting up, improve efficiency.
For solving above-mentioned technical problem, embodiment of the present invention offer techniques below scheme:
First aspect present invention provides a kind of index file to generate method, the method comprise the steps that
Obtaining business datum, described business datum includes data content and type of service;
Obtaining corresponding configuration file according to described type of service, described configuration file includes place pre-to field Reason instruction and word segmentation processing instruction;
Indicate according to described field pretreatment, described data content is carried out pretreatment, generates pretreated Data content;
Indicate according to described word segmentation processing, described pretreated data content is carried out word segmentation processing respectively;
Data content after word segmentation processing is carried out in-line arrangement process, generates the index file of Uniform data format.
For solving above-mentioned technical problem, embodiment of the present invention offer techniques below scheme:
Second aspect present invention provides a kind of index file generating means, and wherein said device includes:
First acquisition module, is used for obtaining business datum, and described business datum includes data content and service class Type;
Second acquisition module, for obtaining corresponding configuration file, described configuration according to described type of service File includes indicating field pretreatment instruction and word segmentation processing;
Pretreatment module, for indicating according to described field pretreatment, carries out pretreatment to described data content, Generate pretreated data content;
Word-dividing mode, for indicating according to described word segmentation processing, to described pretreated data content respectively Carry out word segmentation processing;
Index generation module, for the data content after word segmentation processing carries out in-line arrangement process, generates unified number Index file according to form.
Relative to prior art, in the present embodiment, according to the type of service of multiple business data, obtain relatively The configuration file answered, indicates according to the field pretreatment of configuration file thereafter, the data content to business datum Carry out pretreatment, indicate according to the word segmentation processing of configuration file, pretreated data content is carried out respectively Word segmentation processing, thus generate the index file of Uniform data format.The present invention is directed to the number of different service types According to using corresponding configuration file that data are processed, use thereafter identical program that data content is entered Row participle, is normalized to the index data of Uniform data format by the business datum of different-format, thus can pin Set up index file to multiple business data type is unified, simplify process of setting up, improve efficiency.
Accompanying drawing explanation
Below in conjunction with the accompanying drawings, by the detailed description of the invention of the present invention is described in detail, the skill of the present invention will be made Art scheme and other beneficial effect are apparent.
Fig. 1 is the schematic flow sheet that the index file that first embodiment of the invention provides generates method;
Fig. 2 a generates the schematic flow sheet of method for the index file that second embodiment of the invention provides;
Fig. 2 b and Fig. 2 c generates the configuration interface schematic diagram of method field for the index file that the present invention provides;
Fig. 3 a and Fig. 3 b generates the schematic flow sheet of method for the index file that third embodiment of the invention provides;
The structural representation of the index file generating means that Fig. 4 provides for fourth embodiment of the invention;
The structural representation of the index file generating means that Fig. 5 provides for fifth embodiment of the invention;
The structural representation of the server that Fig. 6 provides for sixth embodiment of the invention.
Detailed description of the invention
Refer to graphic, the most identical element numbers represents identical assembly, and the principle of the present invention is with reality The computing environment that Shi Yi is suitable illustrates.The following description is concrete based on the illustrated present invention Embodiment, it is not construed as limiting other specific embodiment that the present invention is the most detailed herein.
In the following description, the specific embodiment of the present invention will be with reference to by performed by one or multi-section computer Step and symbol illustrate, unless otherwise stating clearly.Therefore, these steps and operation will have mention for several times by Computer performs, and computer as referred to herein performs to include by representing with the data in a structuring pattern The operation of computer processing unit of electronic signal.This operation is changed these data or is maintained at this calculating Position in the memory system of machine, it is reconfigurable or other with the side known to the tester of this area Formula changes the running of this computer.The data structure that these data are maintained is the provider location of this internal memory, its Have by particular characteristics defined in this data form.But, the principle of the invention illustrates with above-mentioned word, It is not represented as a kind of restriction, and this area tester will appreciate that plurality of step and the behaviour of the following stated Also may be implemented in the middle of hardware.
The principle of the present invention uses other wide usages many or specific purpose computing, communication environment or configuration to enter Row operation.Known be suitable for the arithmetic system of the present invention, environment can include with the example of configuration (but not It being limited to) hand-held phone, personal computer, server, multicomputer system, micro computer be main system, master Architected computer and distributed computing environment, which includes any said system or device.
Term as used herein " module " can regard the software object as performing on this arithmetic system as.This Different assemblies, module, engine and service described in literary composition can be regarded as the objective for implementation on this arithmetic system. And device and method as herein described is preferably implemented in the way of software, the most also can be enterprising at hardware Row is implemented, all within scope.
And word " preferably " used herein means serving as example, example or illustration.Feng Wen is described as " preferably " any aspect or design are not necessarily to be construed as more favourable than other aspects or design.On the contrary, word The use of " preferably " is intended to propose in a concrete fashion concept.Term "or" is intended to as used in this application Mean the "or" that comprises and non-excluded "or".I.e., unless otherwise or the clearest, " X makes With A or B " mean that nature includes any one of arrangement.That is, if X uses A;X uses B;Or X Use A and B both, then " X uses A or B " is met in aforementioned any example.
And, although illustrate and describing the disclosure relative to one or more implementations, but this Skilled person will appreciate that equivalent variations and amendment based on to reading and the understanding of the specification and drawings. The disclosure includes all such amendments and modification, and is limited only by the scope of the following claims.Especially Ground, about the various functions performed by said modules (such as element, resource etc.), is used for describing such group The term of part is intended to the appointment function (such as it is functionally of equal value) corresponding to performing described assembly Random component (unless otherwise instructed), though structurally with perform the disclosure shown in this article exemplary reality The open structure of the function in existing mode is not equal to.Although additionally, the special characteristic of the disclosure relative to Only one in some implementations is disclosed, but this feature can with such as can to given or specific should It it is other features one or more combination of expectation and other favourable implementations for.And, with regard to art Language " includes ", " having ", " containing " or its deformation be used in detailed description of the invention or claim for, Such term be intended to by " comprise " to term similar in the way of include.
First embodiment
Refer to the flow process that Fig. 1, Fig. 1 are the index file generation methods that first embodiment of the invention provides show It is intended to.Described method step includes:
In step S101, obtaining business datum, described business datum includes data content and type of service.
Wherein, described index file generates method is based on BS (browser browser, server) System structure, user uses this system by browser, and this system supports the data of multiple business type The index data of Uniform data format is generated under identical platform.
In the present embodiment, described type of service may include that video, music, picture etc., corresponding, Described business datum can include video data, music data and image data etc., the most specifically limits Fixed.
It is understood that the data form of the business datum in the present embodiment can be divided into two parts, its In one part carry service class indication information, another part carries the data that this type of service is corresponding Content.
In step s 102, corresponding configuration file, described configuration file are obtained according to described type of service Indicate including to field pretreatment instruction and word segmentation processing.
It is understood that the corresponding a kind of configuration file of each type of service meeting, wherein, described configuration literary composition Part is that user is pre-configured with according to the feature of the type of service in practical operation and stores in the server.
Wherein, described configuration file contains the field to described data content and carries out the instruction of pretreatment, And the field of described data content is carried out the instruction of participle, described configuration file according to user to each business The configuration of the field of data generates, and the configuration to field herein is not especially limited.
In step s 103, indicate according to described field pretreatment, described data content carried out pretreatment, Generate pretreated data content.
In step S104, indicate according to described word segmentation processing, to described pretreated data content respectively Carry out word segmentation processing;
In step S105, the data content after word segmentation processing is carried out in-line arrangement process, generate uniform data lattice The index file of formula.
It is understood that described step S103 may particularly include to step S105:
Due to the corresponding configuration file of each type of service, the corresponding field pretreatment of the most each type of service refers to Showing, each type of service according to corresponding field pretreatment instruction, carries out pre-place to described data content respectively Reason, can embody the personalized differential operation between different service types;After pretreatment, can be according to platform The word segment template preset and the word segmentation processing preset instruction process, and are i.e. normalized operation, will The business datum of different-format, sends into in-line arrangement processing unit (FSU, Forward Sort Unit) and carries out in-line arrangement Index generates, and is normalized to unified data form, has obtained the in-line arrangement data after normalization, many to adapt to Plant the data retrieval of type of service.
From the foregoing, in the present embodiment, according to the type of service of multiple business data, obtain corresponding Configuration file, indicates according to the field pretreatment of configuration file thereafter, carries out the data content of business datum Pretreatment, indicates according to the word segmentation processing of configuration file, pretreated data content carries out participle respectively Process, thus generate the index file of Uniform data format.The present invention is directed to the data acquisition of different service types With corresponding configuration file, data are processed, use thereafter identical program data content to be carried out point Word, is normalized to the index data of Uniform data format by the business datum of different-format, thus can be for many Planting the unified index file of setting up of traffic data type, simplification is set up process, is improved efficiency.
Second embodiment
The flow process referring to the index file generation method that Fig. 2, Fig. 2 provide for second embodiment of the invention is shown It is intended to.Wherein, it is based on BS (browser, server) that the index file that the present invention provides generates method System structure, user uses this system by browser, and this system supports that the data of multiple business type exist The index data of Uniform data format is generated under identical platform.
In embodiments of the present invention, the property value mainly for the generation of configuration file, i.e. field configures and carries out Analyzing, described method step includes:
In step s 201, the configuration file corresponding to different service types is generated respectively.
It is understood that the corresponding a kind of configuration file of each type of service meeting, wherein, described service class Type may include that video, music, picture etc., corresponding, described business datum include video data, Music data and image data.
In the present embodiment, described configuration file is user according to the feature of the type of service in practical operation in advance Configure and store in the server, described configuration file contains the field to described data content and carries out The instruction of pretreatment, and the field of described data content is carried out the instruction of participle.
In a preferred embodiment, described configuration file can obtain based on following steps:
Step (1), obtain the field configuration information corresponding with type of service;
The property value of multiple fields that the instruction of described field configuration information is preset, described field includes textview field word Section, Numerical Range field and sorting field field;
It is understood that business datum of the present invention includes data content and type of service, in described data Appearance includes multiple document, and document is made up of multiple fields, and wherein the type of field can carry out preset, bag Include textview field field, Numerical Range field and sorting field field.
Further, described textview field field refers to the field of plain textual information, such as: " I likes this Singer ", the field etc. of " this song is really pleasant ", described Numerical Range field refer to represent the numeral of numerical value or The field of alphabetical information, such as: " 1 ", " 5 " or " one ", the field etc. of " five ", described point Class field field refers to that data are carried out the field classified by instruction, such as: a song can be classified as " shaking Rolling class ", " jazz's class " etc., a video can be classified as " film ", " variety ", " news " Deng.
It addition, each field includes at least one attribute, it is possible to claiming configuration item, described property value is by choice box Form be shown, select for user and configure.
The property value of the plurality of field is carried out by step (2), instruction according to the configuration information of described field Configuration, obtains the configuration file corresponding with described type of service.
According to user's configuration to the property value of the attribute of each type of field, obtain and described service class The configuration file that type is corresponding.
Based on this, in further preferred embodiment, can come described many based on mode in detail below The property value of individual field configures, i.e. step (2) can specifically include:
Step (21), according to the instruction of the configuration information of described field to the attribute of described textview field field Property value configures, the textview field field after being configured.
In the present embodiment, what described textview field field mainly comprised is Word message, and wishes to be searched for by user The field arrived;The attribute of described textview field field can include description, data length, major key, importance and One or more combination in participle mode.
Can be in the lump with reference to the attribute configuration interface signal that Fig. 2 b and Fig. 2 c, Fig. 2 b are field, Fig. 2 c is for using Family custom field administration interface signal, is carried out the implication of above-mentioned each attribute of described textview field field below Simple declaration:
A, description: refer to the implication of this field references, play suggesting effect, and Search Results is not affected by this attribute.
B, data length: refer to the greatest length of this field text.At present according to field whether more than 256 bytes Being divided into two grades, greatest length is referred to as long text field, wherein in whole textview field more than the field of 256 bytes In, only one of which field is configurable to long text field.
C, major key: namely major key (primary key), be used for unique field identifying a document, It is referred to as doc_id.Wherein, this field is arranged to change into the value of numeral, and concrete, the value of doc_id is One 64 integer.Owing to this value in the space of uint64_t uniformly should the most preferably use Hash Values etc. produce, and wherein, hash value is the numerical value obtained by logical operations according to data content, different literary compositions The hash value that shelves obtain is different, and hash value has just become the identity card of each document.
D, importance: be the significance level representing text field, can be divided into important, typically and do not weigh Wait.
E, participle mode: be divided into normal participle and prefix participle.Wherein, normal participle refers to according to nature Semanteme carries out participle to text, generally can give tacit consent to selection which;Prefix participle is applicable to search box The scene of prompting combobox.Can be divided into such as " inner search platform part " " interior, internal, internal search, internal Search ... " etc. word, when such user inputs " internal " in the search box, it is possible to point out " inside is searched Rope platform part ".
It is understood that word segmentation processing instruction can be obtained according to the configuration of this participle mode, with root Carry out data content is carried out word segmentation processing according to word segmentation processing instruction.
Step (22), according to the instruction of the configuration information of described field to the attribute of described Numerical Range field Property value configures, the Numerical Range field after being configured.
In the present embodiment, the attribute of described Numerical Range field include description, data type, authority, importance, One or more combination in major key.
Described Numerical Range field is applicable to the information of value type.Such as price, download etc..In this field String value must can be converted into numeral.Hereinafter the implication of each attribute of described Numerical Range field is carried out letter Unitary declaration:
A, description: refer to the implication of this field references, play suggesting effect, and Search Results is not affected by this attribute.
B, data type: in this embodiment, configuration item can be provided with int8, uint8, int16, uint16, int32, Uint32, int64, uint64 and float several types is available.User is according to the possible maximum of this numerical value Scope selects, provided that data in actual value exceed the scope of configuration, it will make mistakes.
C, authority: be used for representing that this field can embody the authority of this document.Such as, video is searched Rope, can select to watch number as authoritative field.Only 0 or 1 Numerical Range field can be appointed as power Prestige field.
D, importance: be the significance level representing this field, can be divided into important, general and inessential etc..
E, major key: the major key definition with textview field field is identical, also refers to major key, be used for unique mark Know the field of a document.It is referred to as doc_id.Wherein, this field is arranged to change into the value of numeral, tool Body, the value of doc_id is 64 integers;Due to this value should in the space of uint64_t uniformly, Therefore preferably employ hash value etc. to produce.
The attribute of described sorting field field is entered by step (23), instruction according to the configuration information of described field Row configuration, the sorting field field after being configured;
In the present embodiment, the attribute of described sorting field field includes that classification is specified in retrieval;
Step (24), according to the textview field field after described configuration, configuration after Numerical Range field and configuration After sorting field field generate the configuration file corresponding with described type of service.
In step S202, obtain business datum.
Wherein, described business datum includes data content and type of service;Described type of service may include that Video, music, picture etc., corresponding, described business datum can include video data, music data And image data etc., it is not especially limited herein.
It is understood that after generating corresponding to the configuration file of different service types, by described configuration literary composition Part is preset in server, thereafter, after getting the business datum of user data, trigger server according to Its type of service, recalls the configuration file corresponding with type of service from preset multiple configuration files, from And process according to configuration file, generate index file.
In step S203, obtain corresponding configuration file according to described type of service.
It is understood that the corresponding a kind of configuration file of each type of service meeting, wherein, described configuration literary composition Part is that user previously generates according to the configuration information in step S201, and stores in the server.
In step S204, indicate according to described field pretreatment, described data content carried out pretreatment, Generate pretreated data content.
Due to the corresponding configuration file of each type of service, the corresponding field pretreatment of the most each type of service refers to Showing, each type of service according to corresponding field pretreatment instruction, carries out pre-place to described data content respectively Reason, as some field of service propelling data is rewritten, data cleansing, supplementary data label etc., can To embody the personalized differential operation between different service types.
In step S205, it is analyzed determining described data content to described pretreated data content Attribute information.
In some embodiments, preset word segment template can be obtained, according to described word segment template to described Pretreated data content is analyzed, and determines the attribute information of described data content.Wherein, described clothes Business device pre-sets multiple word-dividing mode, it may include the data template of multiple types of service, such as music Data, then can include singer data base, song name data storehouse and school database, to it in data template It is analyzed, then can learn the attribute information of this data content;Such as, if this data content belongs to music Type of service, then attribute information refers to the attribute of the value types such as the download of song, playback volume.
In step S206, according to the instruction of described word segmentation processing and described attribute information, to described pretreatment After business datum carry out participle, and the data content after word segmentation processing is carried out in-line arrangement process, generates unified The in-line arrangement index file of data form.
After pretreatment, can indicate according to described attribute information and the word segmentation processing preset and process, i.e. It is normalized operation, by the business datum of different-format, is normalized to unified data form, obtains In-line arrangement data after normalization, to adapt to the data retrieval of multiple business type.
It is understood that after carrying out pretreatment, data can enter in-line arrangement processing unit FSU, carries out suitable Row's index generates.Indicated by the word segmentation processing configured in configuration file, and according to built-in several points The search such as word template carries out data process, calculates wordid, word POS information need to use the data letter arrived Breath, finally exports the in-line arrangement index file of consolidation form.
It is understood that after generating the in-line arrangement index file of Uniform data format, it is also possible to including:
In step S207, described in-line arrangement index file is converted to inverted index file, in order to user according to Described inverted index file is retrieved.
From the foregoing, in the present embodiment, according to the type of service of multiple business data, obtain corresponding Configuration file, indicates according to the field pretreatment of configuration file thereafter, carries out the data content of business datum Pretreatment, indicates according to the word segmentation processing of configuration file, pretreated data content carries out participle respectively Process, thus generate the index file of Uniform data format.The present invention is directed to the data acquisition of different service types With corresponding configuration file, data are processed, use thereafter identical program data content to be carried out point Word, is normalized to the index data of Uniform data format by the business datum of different-format, thus can be for many Planting the unified index file of setting up of traffic data type, simplification is set up process, is improved efficiency.
3rd embodiment
Refer to Fig. 3 a and Fig. 3 b, for the stream of the index file generation method that third embodiment of the invention provides Journey schematic diagram.Wherein, it is based on BS (browser, server) that the index file that the present invention provides generates method System structure, user uses this system by browser, and this system supports the data of multiple business type The index data of Uniform data format is generated under identical platform.
In embodiments of the present invention, the process carrying out pretreatment mainly for data content is analyzed, described Method step includes:
In step S301, obtain business datum.
Wherein, described business datum includes data content and type of service;Described type of service may include that Video, music, picture etc., corresponding, described business datum can include video data, music data And image data etc., it is not especially limited herein.
It is understood that after generating corresponding to the configuration file of different service types, by described configuration literary composition Part is preset in server, thereafter, after getting the business datum of user data, trigger server according to Its type of service, recalls the configuration file corresponding with type of service from preset multiple configuration files, from And process according to configuration file, generate index file.
In step s 302, corresponding configuration file is obtained according to described type of service.
It is understood that the corresponding a kind of configuration file of each type of service meeting, wherein, described configuration literary composition Part contains configuration file include field pretreatment instruction and word segmentation processing are indicated, described configuration file It is that user is pre-configured with according to the feature of the type of service in practical operation and stores in the server.
More preferred, before obtaining business datum (i.e. step S301), it is also possible to including: give birth to respectively Become the configuration file corresponding to different service types, concrete, can first obtain the word corresponding with type of service Section configuration information, enters the property value of the plurality of field according to the instruction of the configuration information of described field thereafter Row configuration, obtains the configuration file corresponding with described type of service.
Wherein, in the embodiment of the present invention, described field can include textview field field, Numerical Range field sum Codomain field, each field includes the attribute of correspondence respectively, thereafter can be according to the finger of the configuration information of each field Show that attribute configures, thus generate configuration file;It is contemplated that generate corresponding to different business class The description of step S201 that the content of the configuration file of type refers to above-described embodiment implements, herein Repeat no more.
It is understood that described server can include the dynamic base of an index data pretreatment, mainly It is after getting configuration file, can indicate according to the field pretreatment in configuration file, to described data Content carries out pretreatment, thus generates pretreated data content.
In the present embodiment, described data content is carried out pretreatment and mainly includes data cleansing and data rewriting, Wherein the execution sequence for data cleansing and data rewriting is not construed as limiting, the most both can advanced row data clear Wash, then carry out data rewriting, it is also possible to advanced row data rewriting, then carry out data cleansing, it is also possible to both Perform simultaneously, be independent of each other between the two, illustrate herein and do not constitute limitation of the invention.
In a kind of embodiment, after getting configuration file, step S303A can be performed:
Refer to Fig. 3 a, in step S303A, indicate according to the field pretreatment in configuration file, advanced Row data cleansing, then carry out data rewriting;Wherein step S303A may particularly include:
Step A, judge whether described data content exists rubbish field;
According to judged result, perform step A1 or step A2;
If step A1 exists rubbish field, then described rubbish field is deleted from described data content, and Judge that the data content after deleting is the need of rewriting;
According to the judged result of step A1, perform step A11 or step A12;
Step A11, if desired rewrite, then the data content after described deletion is rewritten, after rewriting Data content as pretreated data content;
If step A12 need not rewrite, then using the data content after described deletion as pretreated number According to content;
If step A2 does not exist rubbish field, then judge that described data content is the need of rewriting;
According to the judged result of step A2, perform step A21 or step A22;
Step A21, if desired rewrite, then described data content is rewritten, by revised data Hold as pretreated business datum;It is
If step A22 need not rewrite, then using described data content as pretreated data content.
In another kind of embodiment, after getting configuration file, step S303B can be performed:
Refer to Fig. 3 b, in step S303B, indicate according to the field pretreatment in configuration file, advanced Row data rewriting, then carry out data cleansing;Wherein step S303B may particularly include:
B, judge that described data content is the need of rewriting;
According to judged result, perform step B1 or step B2;
B1, if desired rewrite, then described data content is rewritten, and judge in revised data Whether appearance exists rubbish field;
According to the judged result of step B1, perform step B11 or step B12;
If B11 exists rubbish field, then described rubbish field is deleted from described revised data content Removing, the data content after deleting is as pretreated data content;
If there is not rubbish field in B12, then using described revised data content as pretreated number According to content;
If B2 need not rewrite, then judge whether described data content exists rubbish field;
According to the judged result of step B2, perform step B21 or step B22;
If B21 exists rubbish field, then described rubbish field is deleted from described data content, will delete Data content after removing is as pretreated data content;
If there is not rubbish field in B22, then using described data content as pretreated data content.
Further, according to step S303A and step S303B, data cleansing purpose is divisor According to the rubbish field in content, such as punctuation mark etc., these rubbish contents can affect follow-up retrieval and experience, Therefore should remove;And the purpose of data rewriting is to need to carry out special handling due to data, as by some word Sino-British mixing name in Duan is separated into two names etc., it is therefore desirable to generate advance row data at index data Pretreatment operation.
Further preferred, described server can also include the dynamic base of an initial data pretreatment, Mainly processing original business datum, the data after having processed are as the number of above-mentioned pretreatment operation According to input, mainly including Data expansion, format checking etc., wherein Data expansion refers to what partial service pushed Data are the most comprehensive, it is impossible to meet whole searching requirements of user, by capturing other resources in the Internet, The data of supplementary service.As to video, the search of music, supplemented the number of a large amount of non-default system homegrown resources According to;Format checking refers to that the data coming service propelling carry out correctness verification, check whether pushed and Configuring the data type not being inconsistent and field etc., the process that original service data are processed by the present invention the most specifically limits Fixed.
In step s 304, it is analyzed determining described data content to pretreated data content Attribute information;
In some embodiments, preset word segment template can be obtained, according to described word segment template to described Pretreated data content is analyzed, and determines the attribute information of described data content.Wherein, described clothes Business device pre-sets multiple word-dividing mode, it may include the data template of multiple types of service, such as music Data, then can include singer data base, song name data storehouse and school database, to it in data template It is analyzed, then can learn the attribute information of this data content;Such as, if this data content belongs to music Type of service, then attribute information refers to the attribute of the value types such as the download of song, playback volume.
In step S305, according to the instruction of described word segmentation processing and described attribute information, to described pretreatment After business datum carry out participle, and the data content after word segmentation processing is carried out in-line arrangement process, generates unified The in-line arrangement index file of data form.
After pretreatment, can indicate according to described attribute information and the word segmentation processing preset and process, i.e. It is normalized operation, by the business datum of different-format, is normalized to unified data form, obtains In-line arrangement data after normalization, to adapt to the data retrieval of multiple business type.
It is understood that after carrying out pretreatment, data can enter in-line arrangement processing unit FSU and carry out in-line arrangement Index generates.Indicated by the word segmentation processing configured in configuration file, and according to built-in several participles The search such as template carries out data process, calculates wordid, word POS information need to use the data message arrived, Finally the in-line arrangement index file of consolidation form is exported.
It is understood that after generating the in-line arrangement index file of Uniform data format, it is also possible to including:
In step S306, described in-line arrangement index file is converted to inverted index file, in order to user according to Described inverted index file is retrieved.
In conjunction with foregoing, carry out letter with the application scenarios index file to being generated by described method below Single analysis:
It is understood that this generation method is system structure based on BS (browser, server), This system supports that the data of multiple business type generate the index data of Uniform data format under identical platform. First, this platform has realized pageization configuration, after access service data, needs to inform platform current business Data have which data field, the type of each field and property value etc., implement and refer to second in fact Execute the content about field configuration in example, be the most no longer described specifically.
Such as: for novel searching service, having six fields, wherein four fields are as textview field field Need to set up index, have two fields to be supplied to dependency marking as Numerical Range field and use.Select to set up The field of index will carry out semantic participle to each field, calculates wordid, finally sets up inverted index, These fields are exactly the field that can be searched by user.
Wherein, participle mode defines when setting up text index, how the word in each field of cutting.Often Have normal participle, prefix participle, classified index participle etc..
Normal participle carries out normal semantic participle exactly to text, such as " today, weather was the best ", can be divided Become today/weather/true/tetra-words.Above-mentioned sentence is then divided into sky, sky the present/today/today/today by prefix participle Gas/today weather is true/today very six words of weather, this participle mode is mainly used in associational word prompt facility. Classified index participle is a kind of higher usage, can use for some texts having classification, as by little Saying classifications such as being divided into swordsman, describing love affairs, science fiction, after using classified index participle to set up index, business just may be used Inquire about according to the classification of novel, be the novel of science fiction as searched entitled " three bodies " and classification.
In preferred embodiment, it is also possible to the function retrieved in specified domain is provided.Such as: if merely desiring to search Suo Shuming is the book of " three bodies ", and is not desired to search authors' name and comprises the book of " three bodies ", then can use The function of retrieval in specified domain.Platform can increase special character in index entry, to indicate that this is a territory The index entry of interior retrieval.Upon request by a user, also can retrieve plus corresponding mark, so can be straight Connecting index asks the mode of friendship to exclude in other territories the document having " three bodies " this word to hit.
From the foregoing, in the present embodiment, according to the type of service of multiple business data, obtain corresponding Configuration file, indicates according to the field pretreatment of configuration file thereafter, carries out the data content of business datum Pretreatment, indicates according to the word segmentation processing of configuration file, pretreated data content carries out participle respectively Process, thus generate the index file of Uniform data format.The present invention is directed to the data acquisition of different service types With corresponding configuration file, data are processed, use thereafter identical program data content to be carried out point Word, is normalized to the index data of Uniform data format by the business datum of different-format, thus can be for many Planting the unified index file of setting up of traffic data type, simplification is set up process, is improved efficiency.
4th embodiment
The index file provided for ease of preferably implementing the embodiment of the present invention generates method, the embodiment of the present invention Also provide for a kind of and above-mentioned index file and generate the index file generating means that method is corresponding.Wherein noun Implication is identical with above-mentioned index file generation method, implements details and is referred in embodiment of the method Explanation.
The structure referring to the index file generating means that Fig. 4, Fig. 4 provide for sixth embodiment of the invention is shown Being intended to, wherein said device is system structure based on BS (browser, server), and user is by clear Device of looking at uses this system, and this system supports that the data of multiple business type generate uniform data under identical platform The index data of form.
Described device includes: first acquisition module the 401, second acquisition module 402, pretreatment module 403, Word-dividing mode 404 and index generation module 405.
Wherein said first acquisition module 401, is used for obtaining business datum, and described business datum includes data Content and type of service;Described second acquisition module 402, for obtaining corresponding according to described type of service Configuration file, described configuration file includes indicating field pretreatment instruction and word segmentation processing.
It is it is understood that described type of service may include that video, music, picture etc., corresponding, Described business datum can include video data, music data and image data etc., the most specifically limits Fixed.It addition, the data form of the business datum in the present embodiment can be divided into two parts, one of them portion Dividing the information of carrying service class indication, another part carries the data content that this type of service is corresponding.
Wherein, the corresponding a kind of configuration file of each type of service meeting, described configuration file is that user is according to reality The feature of the type of service in the operation of border is pre-configured with and is stored in index file generating means.
Further, described configuration file contains the field to described data content and carry out the finger of pretreatment Show, and the field of described data content carried out the instruction of participle, described configuration file according to user to respectively The configuration of the field of business datum generates, and the configuration to field herein is not especially limited.
Described pretreatment module 403, for indicating according to described field pretreatment, enters described data content Row pretreatment, generates pretreated data content;Described word-dividing mode 404, for according to described participle Process instruction, described pretreated data content is carried out word segmentation processing respectively;Described index generation module 405, for the data content after word segmentation processing being carried out in-line arrangement process, generate the index literary composition of Uniform data format Part.
Due to the corresponding configuration file of each type of service, the corresponding field pretreatment of the most each type of service refers to Showing, each type of service according to corresponding field pretreatment instruction, carries out pre-place to described data content respectively Reason, can embody the personalized differential operation between different service types;After pretreatment, can be according to platform The word segment template preset and the word segmentation processing preset instruction process, and are i.e. normalized operation, will The business datum of different-format, sends into in-line arrangement processing unit FSU and carries out in-line arrangement index generation, be normalized to system The data form of one, has obtained the in-line arrangement data after normalization, to adapt to the data retrieval of multiple business type.
From the foregoing, in the present embodiment, according to the type of service of multiple business data, obtain corresponding Configuration file, indicates according to the field pretreatment of configuration file thereafter, carries out the data content of business datum Pretreatment, indicates according to the word segmentation processing of configuration file, pretreated data content carries out participle respectively Process, thus generate the index file of Uniform data format.The present invention is directed to the data acquisition of different service types With corresponding configuration file, data are processed, use thereafter identical program data content to be carried out point Word, is normalized to the index data of Uniform data format by the business datum of different-format, thus can be for many Planting the unified index file of setting up of traffic data type, simplification is set up process, is improved efficiency.
5th embodiment
The structure referring to the index file generating means that Fig. 5, Fig. 5 provide for fifth embodiment of the invention is shown It is intended to.It should be noted that the index file generating means that the present invention provides is based on BS (browser, clothes Business device) system structure, user uses this system by browser, this system support multiple business type Data under identical platform, generate the index data of Uniform data format.
Wherein said index file generating means may include that first acquisition module the 501, second acquisition module 502, pretreatment module 503, word-dividing mode 504 and index generation module 505, it is to be understood that In this embodiment, the function of above-mentioned each functional module can be corresponding with reference to the first acquisition mould in the 4th embodiment Block the 401, second acquisition module 402, pretreatment module 403, word-dividing mode 404 and index generation module The associated description of 405, does not repeats.
Described device also includes: configuration file generation module 506, before being used for obtaining business datum, respectively Generate the configuration file corresponding to different service types.
It is understood that the corresponding a kind of configuration file of each type of service meeting, wherein, described service class Type may include that video, music, picture etc., corresponding, described business datum include video data, Music data and image data.In the present embodiment, described configuration file is that user is according in practical operation The feature of type of service is pre-configured with and stores in the server, contains described in described configuration file The field of data content carries out the instruction of pretreatment, and the field of described data content carries out the finger of participle Show.
In a preferred embodiment, described configuration file can obtain based in the following manner:
Described configuration file generation module 506 includes: acquiring unit 5061 and dispensing unit 5062;
Wherein said acquiring unit 5061, for obtaining the field configuration information corresponding with type of service, described The property value of multiple fields that the instruction of field configuration information is preset, described field includes textview field field, numerical value Territory field and sorting field field.
It is understood that business datum of the present invention includes data content and type of service, in described data Appearance includes multiple document, and document is made up of multiple fields, and wherein the type of field can carry out preset, bag Include textview field field, Numerical Range field and sorting field field.It addition, each field includes that at least one belongs to Property, it is possible to claiming configuration item, described property value is shown by the form of choice box, selects for user And configuration.
Described dispensing unit 5062, is used for the instruction of the configuration information according to described field to the plurality of field Property value configure, obtain the configuration file corresponding with described type of service.
Based on this, in further preferred embodiment, can come described many based on mode in detail below The property value of individual field configures;Described dispensing unit 5062 may include that the first configuration subelement, Two configuration subelements, the 3rd configuration subelement and generation subelement;
Described first configuration subelement, is used for the instruction of the configuration information according to described field to described textview field The property value of the attribute of field configures, the textview field field after being configured, described textview field field Attribute includes one or more the group in description, data length, major key, importance and participle mode Close;
Wherein, the implication of each attribute of described textview field field is simply described as follows:
Describing the implication referring to this field references, play suggesting effect, Search Results is not affected by this attribute;
Data length refers to the greatest length of this field text.Whether divide more than 256 bytes according to field at present Being two grades, greatest length is referred to as long text field more than the field of 256 bytes, wherein in whole textview field, Only one of which field is configurable to long text field;
Major key i.e. major key, be used for unique field identifying a document, referred to as doc_id.This word Section is necessary for changing into the value of numeral.Concrete, the value of doc_id is 64 integers;Due to this value In the space of uint64_t uniformly should therefore preferably employ hash value etc. and produce, wherein, hash value Being the numerical value obtained by logical operations according to data content, the hash value that different documents obtains is different, Hash value has just become the identity card of each document;
Importance is the significance level representing text field, can be divided into important, general and inessential etc.;
Participle mode is divided into normal participle and prefix participle.Wherein, normal participle refers to according to natural semanteme Text is carried out participle, generally can give tacit consent to selection which;Prefix participle is applicable to search box prompting The scene of combobox.
Described second configuration subelement, is used for the instruction of the configuration information according to described field to described Numerical Range The property value of the attribute of field configures, the Numerical Range field after being configured, described Numerical Range field Attribute includes one or more the combination in description, data type, authority, importance, major key;
Described Numerical Range field is applicable to the information of value type.Such as price, download etc..In this field String value must can be converted into numeral.Wherein, the implication of each attribute of described Numerical Range field is briefly Bright as follows:
Describing the implication referring to this field references, play suggesting effect, Search Results is not affected by this attribute;
Data type is that in this embodiment, configuration item can be provided with int8, uint8, int16, uint16, int32, Uint32, int64, uint64 and float several types is available.User is according to the possible maximum of this numerical value Scope selects, provided that data in actual value exceed the scope of configuration, it will make mistakes;
Authority is used for representing that this field can embody the authority of this document.Such as, for video search, Can select to watch number as authoritative field.Only 0 or 1 Numerical Range field can be appointed as authority Field;
Importance is the significance level representing this field, can be divided into important, general and inessential etc.;
Major key is identical with the major key definition of textview field field, also refers to major key, is used for unique mark one The field of document.It is referred to as doc_id.Wherein, this field is necessary for changing into the value of numeral.Concrete, The value of doc_id is 64 integers;Owing to this value should be in the space of uint64_t uniformly the most excellent Choosing uses hash value etc. to produce.
Described 3rd configuration subelement, is used for the instruction of the configuration information according to described field to described sorting field The attribute of field configures, the sorting field field after being configured, and the attribute of described sorting field field includes Classification is specified in retrieval;Described generation subelement, after according to the textview field field after described configuration, configuration Numerical Range field generate the configuration file corresponding with described type of service with the sorting field field after configuration.
It is further preferred that described pretreatment module 503 can refer to according to the field pretreatment in configuration file Show and data content carried out pretreatment, described data content is carried out pretreatment mainly include data cleansing and Data rewriting, wherein the execution sequence for data cleansing and data rewriting is not construed as limiting, the most both can be first Carry out data cleansing, then carry out data rewriting, it is also possible to advanced row data rewriting, then carry out data cleansing, Both can also perform simultaneously, be independent of each other between the two, illustrate herein and do not constitute limitation of the invention.
Based on this, in a kind of embodiment, the advanced row data cleansing of described pretreatment module 503, then Carrying out data rewriting, described pretreatment module 503 may include that the first judging unit 5031, first processes Unit 5032 and the second processing unit 5033;
Wherein, described first judging unit 5031, it is used for judging whether described data content exists rubbish word Section;
, if for there is rubbish field, then by described rubbish field from described in described first processing unit 5032 Data content is deleted, and judges that the data content after deleting, the need of rewriting, is if desired rewritten, then will Data content after described deletion is rewritten, using revised data content as in pretreated data Hold;If need not rewrite, then using the data content after described deletion as pretreated data content;
Described second processing unit 5033, if for there is not rubbish field, then judging that described data content is No needs is rewritten, and if desired rewrites, is then rewritten by described data content, by revised data content As pretreated business datum;If need not rewrite, then using described data content as pretreated Data content.
In another kind of embodiment, the advanced row data rewriting of described pretreatment module 503, then count According to cleaning, described pretreatment module 503 may include that the second judging unit the 5034, the 3rd processing unit 5035 And fourth processing unit 5036;
Wherein, described second judging unit 5034, it is used for judging that described data content is the need of rewriting;
Described 3rd processing unit 5035, for if desired rewriting, then rewrites described data content, And judge whether revised data content exists rubbish field, if there is rubbish field, then by described Rubbish field is deleted from described revised data content, and the data content after deleting is as after pretreatment Data content, if there is not rubbish field, then using described revised data content as pretreated Data content;
Whether described fourth processing unit 5036, if rewriting for need not, then judge in described data content There is rubbish field, if there is rubbish field, then described rubbish field being deleted from described data content, Data content after deleting is as pretreated data content, if there is not rubbish field, then by described Data content is as pretreated data content.
Further, described word-dividing mode 504 may include that attribute information determines unit, for institute State pretreated data content and be analyzed determining the attribute information of described data content;Participle unit, For according to the instruction of described word segmentation processing and described attribute information, described pretreated business datum being entered Row participle, generates the data content after word segmentation processing.
In some embodiments, described attribute information determines that unit may include that acquisition subelement, is used for Obtain preset word segment template;Determine subelement, be used for according to described word segment template described pretreated Data content is analyzed, and determines the attribute information of described data content.Wherein, in described server in advance It is provided with multiple word-dividing mode, it may include the data template of multiple types of service, such as the data of music, then counts According to template can include singer data base, song name data storehouse and school database, it is analyzed, Then can learn the attribute information of this data content.
After pretreatment, can indicate according to described attribute information and the word segmentation processing preset and process, i.e. It is normalized operation, by the business datum of different-format, is normalized to unified data form, obtains In-line arrangement data after normalization, to adapt to the data retrieval of multiple business type.
It is understood that after carrying out pretreatment, data can enter in-line arrangement processing unit FSU and carry out in-line arrangement Index generates.Indicated by the word segmentation processing configured in configuration file, and according to built-in several participles The search such as template carries out data process, calculates wordid, word POS information need to use the data message arrived, Finally the in-line arrangement index file of consolidation form is exported.
It is understood that after generating the in-line arrangement index file of Uniform data format, described device also may be used To include: modular converter 507, for described in-line arrangement index file is converted to inverted index file, in order to User retrieves according to described inverted index file.
From the foregoing, in the present embodiment, according to the type of service of multiple business data, obtain corresponding Configuration file, indicates according to the field pretreatment of configuration file thereafter, carries out the data content of business datum Pretreatment, indicates according to the word segmentation processing of configuration file, pretreated data content carries out participle respectively Process, thus generate the index file of Uniform data format.The present invention is directed to the data acquisition of different service types With corresponding configuration file, data are processed, use thereafter identical program data content to be carried out point Word, is normalized to the index data of Uniform data format by the business datum of different-format, thus can be for many Planting the unified index file of setting up of traffic data type, simplification is set up process, is improved efficiency.
Sixth embodiment
The embodiment of the present invention also provides for a kind of server, wherein can be with the index file of the integrated embodiment of the present invention Generating means, as shown in Figure 6, it illustrates the structural representation of server involved by the embodiment of the present invention, Specifically:
This server can include one or the processor 601, or of more than one process core The memorizer 602 of above computer-readable recording medium, radio frequency (Radio Frequency, RF) circuit 603, The parts such as power supply 604, input block 605 and display unit 606.It will be understood by those skilled in the art that Server architecture shown in Fig. 6 is not intended that the restriction to server, can include more more or more than diagram Few parts, or combine some parts, or different parts are arranged.Wherein:
Processor 601 is the control centre of this server, utilizes various interface and the whole server of connection Various piece, by run or perform be stored in the software program in memorizer 602 and/or module, and Call the data being stored in memorizer 602, perform the various functions of server and process data, thus right Server carries out integral monitoring.Optionally, processor 601 can include one or more process core;Preferably , processor 601 can integrated application processor and modem processor, wherein, application processor is main Processing operating system, user interface and application program etc., modem processor mainly processes radio communication. It is understood that above-mentioned modem processor can not also be integrated in processor 601.
Memorizer 602 can be used for storing software program and module, and processor 601 is stored in by operation The software program of reservoir 602 and module, thus perform the application of various function and data process.Memorizer 602 can mainly include store program area and storage data field, wherein, storage program area can store operating system, Application program (such as sound-playing function, image player function etc.) etc. needed at least one function;Deposit Storage data field can store the data etc. that the use according to server is created.Additionally, memorizer 602 can wrap Include high-speed random access memory, it is also possible to include nonvolatile memory, for example, at least one disk storage Device, flush memory device or other volatile solid-state parts.Correspondingly, memorizer 602 can also wrap Include Memory Controller, to provide the processor 601 access to memorizer 602.
During RF circuit 603 can be used for receiving and sending messages, the reception of signal and transmission, especially, by base station Downlink information receive after, transfer to one or more than one processor 601 process;It addition, will relate to The data of row are sent to base station.Generally, RF circuit 603 include but not limited to antenna, at least one amplifier, Tuner, one or more agitator, subscriber identity module (SIM) card, transceiver, bonder, Low-noise amplifier (LNA, LowNoise Amplifier), duplexer etc..Additionally, RF circuit 603 Can also be communicated with network and other equipment by radio communication.Described radio communication can use arbitrary communication Standard or agreement, include but not limited to global system for mobile communications (GSM, Global System ofMobile Communication), general packet radio service (GPRS, General PacketRadio Service), CDMA (CDMA, Code DivisionMultiple Access), WCDMA (WCDMA, Wideband Code Division Multiple Access), Long Term Evolution (LTE, Long Term Evolution), Email, Short Message Service (SMS, ShortMessaging Service) etc..
Server also includes the power supply 604 (such as battery) powered to all parts, it is preferred that power supply can With logically contiguous with processor 601 by power-supply management system, thus realize management by power-supply management system The functions such as charging, electric discharge and power managed.Power supply 604 can also include one or more directly Stream or alternating current power supply, recharging system, power failure detection circuit, power supply changeover device or inverter, electricity The random component such as source positioning indicator.
This server may also include input block 605, and this input block 605 can be used for receiving the numeral of input Or character information, and produce the keyboard relevant with user setup and function control, mouse, action bars, Optics or the input of trace ball signal.
This server may also include display unit 606, and this display unit 606 can be used for display and inputted by user Information or be supplied to the information of user and the various graphical user interface of server, these graphical users connect Mouth can be made up of figure, text, icon, video and its combination in any.Display unit 608 can include Display floater, optionally, can use liquid crystal display (LCD, Liquid Crystal Display), The forms such as Organic Light Emitting Diode (OLED, Organic Light-Emitting Diode) configure display surface Plate.
Concrete the most in the present embodiment, the processor 601 in server can according to following instruction, by one or The executable file that the process of more than one application program is corresponding is loaded in memorizer 602, and by processing Device 601 runs storage application program in the memory 602, thus realizes various function, as follows:
Obtaining business datum, described business datum includes data content and type of service;According to described service class Type obtains corresponding configuration file, and described configuration file includes field pretreatment instruction and word segmentation processing Instruction;Indicate according to described field pretreatment, described data content is carried out pretreatment, after generating pretreatment Data content;Indicate according to described word segmentation processing, described pretreated data content is carried out respectively point Word processes;Data content after word segmentation processing is carried out in-line arrangement process, generates the index literary composition of Uniform data format Part.
Preferably, described processor 601 is additionally operable to: generate the configuration literary composition corresponding to different service types respectively Part.
Further, obtaining the field configuration information corresponding with type of service, described field configuration information indicates The property value of preset multiple fields, described field includes textview field field, Numerical Range field and sorting field Field;The property value of the plurality of field is configured by the instruction of the configuration information according to described field, To the configuration file corresponding with described type of service.
Preferably, described processor 601 is additionally operable to: judge whether to exist in described data content rubbish field;
If there is rubbish field, then described rubbish field is deleted from described data content, and judge to delete After data content the need of rewriting, if desired rewrite, then the data content after described deletion changed Write, using revised data content as pretreated data content;If need not rewrite, then by described Data content after deletion is as pretreated data content;
If there is not rubbish field, then judge that described data content, the need of rewriting, is if desired rewritten, then Described data content is rewritten, using revised data content as pretreated business datum;If Need not rewrite, then using described data content as pretreated data content.
Preferably, described processor 601 is additionally operable to: judge that described data content is the need of rewriting;
If desired rewrite, then described data content is rewritten, and judge in revised data content Whether there is rubbish field, if there is rubbish field, then by described rubbish field from described revised data Deleting in content, the data content after deleting is as pretreated data content, if there is not rubbish word Section, then using described revised data content as pretreated data content;
If need not rewrite, then judge whether described data content exists rubbish field, if there is rubbish word Section, then delete described rubbish field from described data content, and the data content after deleting is as pre-place , if there is not rubbish field, then using described data content as pretreated data in the data content after reason Content.
Preferably, described processor 601 is additionally operable to:
The property value of the attribute of described textview field field is joined by the instruction of the configuration information according to described field Put, the textview field field after being configured, the attribute of described textview field field include description, data length, One or more combination in major key, importance and participle mode;
The property value of the attribute of described Numerical Range field is joined by the instruction of the configuration information according to described field Put, the Numerical Range field after being configured, the attribute of described Numerical Range field include description, data type, One or more combination in authority, importance, major key;
The attribute of described sorting field field is configured by the instruction of the configuration information according to described field, obtains Sorting field field after configuration, the attribute of described sorting field field includes that classification is specified in retrieval;
According to the Numerical Range field after the textview field field after described configuration, configuration and the sorting field word after configuration The configuration file that Duan Shengcheng is corresponding with described type of service.
Preferably, described processor 601 is additionally operable to: described pretreated data content is analyzed with Determine the attribute information of described data content;According to the instruction of described word segmentation processing and described attribute information, right Described pretreated business datum carries out participle, generates the data content after word segmentation processing.
Further, described in-line arrangement index file is converted to inverted index file, in order to user is according to described Inverted index file is retrieved.
Preferably, described processor 601 is additionally operable to: obtain preset word segment template;According to described participle mould Described pretreated data content is analyzed by plate, determines the attribute information of described data content.
It is understood that in the above-described embodiment, the description to each embodiment all emphasizes particularly on different fields, certain Individual embodiment does not has the part described in detail, may refer to index file corresponding above and generate retouching in detail of method Stating, here is omitted.
From the foregoing, the server that the present embodiment provides, according to the type of service of multiple business data, obtain Take corresponding configuration file, indicate according to the field pretreatment of configuration file thereafter, the number to business datum Carry out pretreatment according to content, indicate according to the word segmentation processing of configuration file, pretreated data content is divided Do not carry out word segmentation processing, thus generate the index file of Uniform data format.The present invention is directed to different business class Data are processed by the data acquisition of type with corresponding configuration file, use thereafter identical program to data Content carries out participle, and the business datum of different-format is normalized to the index data of Uniform data format, from And can simplify, for the unified index file of setting up of multiple business data type, process of setting up, improve efficiency.
The embodiment of the present invention provide described index file generating means, be such as computer, panel computer, Mobile phone with touch function etc., the rope that described index file generating means is corresponding with foregoing embodiments Draw document generating method and belong to same design, described index file generating means can corresponding be run described Index file generates the either method provided in embodiment of the method, and it implements process and refers to the described of correspondence Index file generates embodiment of the method, and here is omitted.
It should be noted that for index file generation method of the present invention, this area common test people Member is appreciated that realizing index file described in the embodiment of the present invention generates all or part of flow process of method, is can Completing with the hardware controlling to be correlated with by computer program, described computer program can be stored in a calculating In machine read/write memory medium, as being stored in the memorizer of terminal, and by least one in this terminal Reason device performs, and can include the flow process generating the embodiment of method such as described index file in the process of implementation.Its In, described storage medium can be magnetic disc, CD, read only memory (ROM, Read Only Memory), Random access memory (RAM, RandomAccess Memory) etc..
For the index file generating means of the embodiment of the present invention, its each functional module can be integrated in respectively One processes in chip, it is also possible to be that modules is individually physically present, it is also possible to two or more moulds Block is integrated in a module.Above-mentioned integrated module both can realize to use the form of hardware, it is also possible to adopts Realize by the form of software function module.If described integrated module realizes with the form of software function module And during as independent production marketing or use, it is also possible to it is stored in a computer read/write memory medium, Described storage medium is such as read only memory, disk or CD etc..
A kind of index file provided the embodiment of the present invention above generates method and device and has carried out detailed Jie Continuing, principle and the embodiment of the present invention are set forth by specific case used herein, above enforcement The explanation of example is only intended to help to understand method and the core concept thereof of the present invention;Simultaneously for this area Technical staff, according to the thought of the present invention, the most all will change, In sum, this specification content should not be construed as limitation of the present invention.

Claims (18)

1. an index file generates method, it is characterised in that described method includes:
Obtaining business datum, described business datum includes data content and type of service;
Obtaining corresponding configuration file according to described type of service, described configuration file includes place pre-to field Reason instruction and word segmentation processing instruction;
Indicate according to described field pretreatment, described data content is carried out pretreatment, generates pretreated Data content;
Indicate according to described word segmentation processing, described pretreated data content is carried out word segmentation processing respectively;
Data content after word segmentation processing is carried out in-line arrangement process, generates the index file of Uniform data format.
Index file the most according to claim 1 generates method, it is characterised in that described acquisition business Before data, also include:
Generate the configuration file corresponding to different service types respectively.
Index file the most according to claim 2 generates method, it is characterised in that described generate respectively Corresponding to the configuration file of different service types, including:
Obtain the field configuration information corresponding with type of service, preset multiple of described field configuration information instruction The property value of field, described field includes textview field field, Numerical Range field and sorting field field;
The property value of the plurality of field is configured by the instruction of the configuration information according to described field, obtains The configuration file corresponding with described type of service.
4. generate method according to the index file described in any one of claims 1 to 3, it is characterised in that institute State and indicate according to described field pretreatment, described data content is carried out pretreatment, generates pretreated number According to content, including:
Judge whether described data content exists rubbish field;
If there is rubbish field, then described rubbish field is deleted from described data content, and judge to delete After data content the need of rewriting, if desired rewrite, then the data content after described deletion changed Write, using revised data content as pretreated data content;If need not rewrite, then by described Data content after deletion is as pretreated data content;
If there is not rubbish field, then judge that described data content, the need of rewriting, is if desired rewritten, then Described data content is rewritten, using revised data content as pretreated business datum;If Need not rewrite, then using described data content as pretreated data content.
5. generate method according to the index file described in any one of claims 1 to 3, it is characterised in that institute State and indicate according to described field pretreatment, described data content is carried out pretreatment, generates pretreated number According to content, including:
Judge that described data content is the need of rewriting;
If desired rewrite, then described data content is rewritten, and judge in revised data content Whether there is rubbish field, if there is rubbish field, then by described rubbish field from described revised data Deleting in content, the data content after deleting is as pretreated data content, if there is not rubbish word Section, then using described revised data content as pretreated data content;
If need not rewrite, then judge whether described data content exists rubbish field, if there is rubbish word Section, then delete described rubbish field from described data content, and the data content after deleting is as pre-place , if there is not rubbish field, then using described data content as pretreated data in the data content after reason Content.
Index file the most according to claim 3 generates method, it is characterised in that described in described basis The property value of the plurality of field is configured by the instruction of the configuration information of field, obtains and described service class The configuration file that type is corresponding, including:
The property value of the attribute of described textview field field is joined by the instruction of the configuration information according to described field Put, the textview field field after being configured, the attribute of described textview field field include description, data length, One or more combination in major key, importance and participle mode;
The property value of the attribute of described Numerical Range field is joined by the instruction of the configuration information according to described field Put, the Numerical Range field after being configured, the attribute of described Numerical Range field include description, data type, One or more combination in authority, importance, major key;
The attribute of described sorting field field is configured by the instruction of the configuration information according to described field, obtains Sorting field field after configuration, the attribute of described sorting field field includes that classification is specified in retrieval;
According to the Numerical Range field after the textview field field after described configuration, configuration and the sorting field word after configuration The configuration file that Duan Shengcheng is corresponding with described type of service.
7. generate method according to the index file described in any one of claims 1 to 3, it is characterised in that institute State and indicate according to described word segmentation processing, described pretreated data content is carried out respectively the step of word segmentation processing Suddenly, including:
Described pretreated data content is analyzed determining the attribute information of described data content;
According to the instruction of described word segmentation processing and described attribute information, described pretreated business datum is entered Row participle, generates the data content after word segmentation processing.
Index file the most according to claim 7 generate method, it is characterised in that described to participle at Data content after reason carries out in-line arrangement process, after generating the in-line arrangement index file of Uniform data format, also wraps Include:
Described in-line arrangement index file is converted to inverted index file, in order to user is according to described inverted index literary composition Part is retrieved.
Index file the most according to claim 7 generates method, it is characterised in that described to described pre- Data content after process is analyzed determining the attribute information of described data content, including:
Obtain preset word segment template;
According to described word segment template, described pretreated data content is analyzed, in determining described data The attribute information held.
10. an index file generating means, it is characterised in that described device includes:
First acquisition module, is used for obtaining business datum, and described business datum includes data content and service class Type;
Second acquisition module, for obtaining corresponding configuration file, described configuration according to described type of service File includes indicating field pretreatment instruction and word segmentation processing;
Pretreatment module, for indicating according to described field pretreatment, carries out pretreatment to described data content, Generate pretreated data content;
Word-dividing mode, for indicating according to described word segmentation processing, to described pretreated data content respectively Carry out word segmentation processing;
Index generation module, for the data content after word segmentation processing carries out in-line arrangement process, generates unified number Index file according to form.
11. index file generating means according to claim 10, it is characterised in that described device is also Including: configuration file generation module, before being used for obtaining business datum, generate respectively corresponding to different business The configuration file of type.
12. index file generating means according to claim 11, it is characterised in that described configuration literary composition Part generation module includes:
Acquiring unit, for obtaining the field configuration information corresponding with type of service, described field configuration information Indicating the property value of preset multiple fields, described field includes textview field field, Numerical Range field and divides One or more combination in class field field;
Dispensing unit, be used for the configuration information according to described field indicates the property value to the plurality of field Configure, obtain the configuration file corresponding with described type of service.
13. according to the index file generating means described in any one of claim 10 to 12, it is characterised in that Described pretreatment module, including:
First judging unit, is used for judging whether to exist in described data content rubbish field;
, if for there is rubbish field, then by described rubbish field from described data content in the first processing unit Middle deletion, and judge that the data content after deleting, the need of rewriting, is if desired rewritten, then by described deletion After data content rewrite, using revised data content as pretreated data content;If no Need to rewrite, then using the data content after described deletion as pretreated data content;
Second processing unit, if for there is not rubbish field, then judging that described data content is the need of changing Write, if desired rewrite, then described data content is rewritten, using revised data content as pre-place Business datum after reason;If need not rewrite, then using described data content as pretreated data content.
14. according to the index file generating means described in any one of claim 10 to 12, it is characterised in that Described pretreatment module, including:
Second judging unit, is used for judging that described data content is the need of rewriting;
3rd processing unit, for if desired rewriting, then rewrites described data content, and judge by Whether revised data content exists rubbish field, if there is rubbish field, then by described rubbish field Deleting from described revised data content, the data content after deleting is as in pretreated data Hold, if there is not rubbish field, then using described revised data content as pretreated data content;
Fourth processing unit, if for need not rewriting, then judging whether there is rubbish in described data content Field, if there is rubbish field, then deletes described rubbish field from described data content, after deleting Data content as pretreated data content, if there is not rubbish field, then by described data content As pretreated data content.
15. index file generating means according to claim 12, it is characterised in that described configuration list Unit, including:
First configuration subelement, is used for the instruction of the configuration information according to described field to described textview field field The property value of attribute configure, the textview field field after being configured, the attribute of described textview field field Including one or more the combination in description, data length, major key, importance and participle mode;
Second configuration subelement, is used for the instruction of the configuration information according to described field to described Numerical Range field The property value of attribute configure, the Numerical Range field after being configured, the attribute of described Numerical Range field Including one or more the combination in description, data type, authority, importance, major key;
3rd configuration subelement, is used for the instruction of the configuration information according to described field to described sorting field field Attribute configure, the sorting field field after being configured, the attribute of described sorting field field include retrieval Specify classification;
Generate subelement, for according to the textview field field after described configuration, configuration after Numerical Range field and Sorting field field after configuration generates the configuration file corresponding with described type of service.
16. according to the index file generating means described in any one of claim 10 to 12, it is characterised in that Described word-dividing mode, including:
Attribute information determines unit, described for being analyzed described pretreated data content determining The attribute information of data content;
Participle unit, for according to the instruction of described word segmentation processing and described attribute information, to described pretreatment After business datum carry out participle, generate the data content after word segmentation processing.
17. index file generating means according to claim 16, it is characterised in that described device is also Including:
Modular converter, for being converted to inverted index file by described in-line arrangement index file, in order to user according to Described inverted index file is retrieved.
18. index file generating means according to claim 16, it is characterised in that described attribute is believed Breath determines unit, including:
Obtain subelement, for obtaining preset word segment template;
Determine subelement, for described pretreated data content being analyzed according to described word segment template, Determine the attribute information of described data content.
CN201510039519.5A 2015-01-27 2015-01-27 Index file generation method and device Active CN105988996B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510039519.5A CN105988996B (en) 2015-01-27 2015-01-27 Index file generation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510039519.5A CN105988996B (en) 2015-01-27 2015-01-27 Index file generation method and device

Publications (2)

Publication Number Publication Date
CN105988996A true CN105988996A (en) 2016-10-05
CN105988996B CN105988996B (en) 2020-04-10

Family

ID=57034424

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510039519.5A Active CN105988996B (en) 2015-01-27 2015-01-27 Index file generation method and device

Country Status (1)

Country Link
CN (1) CN105988996B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107179953A (en) * 2017-03-31 2017-09-19 北京奇艺世纪科技有限公司 A kind of index file generation method, apparatus and system
CN107256206A (en) * 2017-05-24 2017-10-17 北京京东尚科信息技术有限公司 The method and apparatus of character stream format conversion
CN108062297A (en) * 2017-11-22 2018-05-22 万兴科技股份有限公司 A kind of creation method, creating device and the terminal device of pdf document textview field
CN108241713A (en) * 2016-12-27 2018-07-03 南京烽火软件科技有限公司 A kind of inverted index search method based on polynary cutting
CN109241098A (en) * 2018-08-08 2019-01-18 南京中新赛克科技有限责任公司 A kind of enquiring and optimizing method of distributed data base
CN109327321A (en) * 2017-08-01 2019-02-12 中兴通讯股份有限公司 Network model business executes method, apparatus, SDN controller and readable storage medium storing program for executing
CN109783444A (en) * 2018-12-26 2019-05-21 亚信科技(中国)有限公司 Multichannel file index method, device, computer equipment and storage medium
CN110427368A (en) * 2019-07-12 2019-11-08 深圳绿米联创科技有限公司 Data processing method, device, electronic equipment and storage medium
CN110489417A (en) * 2019-07-25 2019-11-22 深圳壹账通智能科技有限公司 A kind of data processing method and relevant device
CN110990126A (en) * 2019-12-12 2020-04-10 北京明略软件系统有限公司 Method and device for realizing shortcut front-end service page based on js
CN113468393A (en) * 2021-06-09 2021-10-01 北京达佳互联信息技术有限公司 Index generation method and device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102567418A (en) * 2010-12-23 2012-07-11 北大方正集团有限公司 Methods and devices for integrating and searching data
US20140032703A1 (en) * 2008-05-30 2014-01-30 Matthew A. Wormley System and method for an expandable computer storage system
CN103823799A (en) * 2012-11-16 2014-05-28 镇江诺尼基智能技术有限公司 New-generation industry knowledge full-text search method
CN104199977A (en) * 2014-09-24 2014-12-10 浪潮软件股份有限公司 Method for creating information search based on data in database

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140032703A1 (en) * 2008-05-30 2014-01-30 Matthew A. Wormley System and method for an expandable computer storage system
CN102567418A (en) * 2010-12-23 2012-07-11 北大方正集团有限公司 Methods and devices for integrating and searching data
CN103823799A (en) * 2012-11-16 2014-05-28 镇江诺尼基智能技术有限公司 New-generation industry knowledge full-text search method
CN104199977A (en) * 2014-09-24 2014-12-10 浪潮软件股份有限公司 Method for creating information search based on data in database

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
徐树振 等: "企业非结构化数据检索研究", 《信息技术》 *

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108241713B (en) * 2016-12-27 2021-12-28 南京烽火星空通信发展有限公司 Inverted index retrieval method based on multi-element segmentation
CN108241713A (en) * 2016-12-27 2018-07-03 南京烽火软件科技有限公司 A kind of inverted index search method based on polynary cutting
CN107179953B (en) * 2017-03-31 2020-04-03 北京奇艺世纪科技有限公司 Index file generation method, device and system
CN107179953A (en) * 2017-03-31 2017-09-19 北京奇艺世纪科技有限公司 A kind of index file generation method, apparatus and system
CN107256206A (en) * 2017-05-24 2017-10-17 北京京东尚科信息技术有限公司 The method and apparatus of character stream format conversion
CN107256206B (en) * 2017-05-24 2021-04-30 北京京东尚科信息技术有限公司 Method and device for converting character stream format
CN109327321B (en) * 2017-08-01 2021-10-15 中兴通讯股份有限公司 Network model service execution method and device, SDN controller and readable storage medium
CN109327321A (en) * 2017-08-01 2019-02-12 中兴通讯股份有限公司 Network model business executes method, apparatus, SDN controller and readable storage medium storing program for executing
CN108062297A (en) * 2017-11-22 2018-05-22 万兴科技股份有限公司 A kind of creation method, creating device and the terminal device of pdf document textview field
CN108062297B (en) * 2017-11-22 2021-06-15 深圳市亿图软件有限公司 PDF file text field creating method and device and terminal equipment
CN109241098A (en) * 2018-08-08 2019-01-18 南京中新赛克科技有限责任公司 A kind of enquiring and optimizing method of distributed data base
CN109241098B (en) * 2018-08-08 2022-02-18 南京中新赛克科技有限责任公司 Query optimization method for distributed database
CN109783444A (en) * 2018-12-26 2019-05-21 亚信科技(中国)有限公司 Multichannel file index method, device, computer equipment and storage medium
CN110427368A (en) * 2019-07-12 2019-11-08 深圳绿米联创科技有限公司 Data processing method, device, electronic equipment and storage medium
CN110427368B (en) * 2019-07-12 2022-07-12 深圳绿米联创科技有限公司 Data processing method and device, electronic equipment and storage medium
WO2021012553A1 (en) * 2019-07-25 2021-01-28 深圳壹账通智能科技有限公司 Data processing method and related device
CN110489417A (en) * 2019-07-25 2019-11-22 深圳壹账通智能科技有限公司 A kind of data processing method and relevant device
CN110489417B (en) * 2019-07-25 2023-03-28 深圳壹账通智能科技有限公司 Data processing method and related equipment
CN110990126A (en) * 2019-12-12 2020-04-10 北京明略软件系统有限公司 Method and device for realizing shortcut front-end service page based on js
CN113468393A (en) * 2021-06-09 2021-10-01 北京达佳互联信息技术有限公司 Index generation method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN105988996B (en) 2020-04-10

Similar Documents

Publication Publication Date Title
CN105988996A (en) Index file generation method and device
CN103353899B (en) The accurate searching method of a kind of integrated information
US20110179061A1 (en) Extraction and Publication of Reusable Organizational Knowledge
CN101578617A (en) Method, apparatus and computer program product for making semantic annotations for easy file organization and search
CN104516902A (en) Semantic information acquisition method and corresponding keyword extension method and search method
CN102665231B (en) Method of automatically generating parameter configuration file for LTE (Long Term Evolution) system
CN106201890B (en) The performance optimization method and server of a kind of application
CN107305527B (en) Code file processing method and device
CN111651453B (en) User history behavior query method and device, electronic equipment and storage medium
CN107391509A (en) Label recommendation method and device
CN101281430A (en) Apparatus with expression symbol associating input function and associating input method
CN110390569A (en) A kind of content promotion method, device and storage medium
CN105677148A (en) Terminal application searching method and device
CN110928917A (en) Target user determination method and device, computing equipment and medium
CN104834759A (en) Realization method and device for electronic design
CN104412262A (en) Method and apparatus for providing task-based service recommendations
CN110069769A (en) Using label generating method, device and storage equipment
CN107423291A (en) A kind of data translating method and client device
CN109003012B (en) Goods location recommendation link information acquisition method, goods location recommendation method, device and system
CN106201198B (en) Lookup method, device and the mobile terminal of terminal applies
CN110489032B (en) Dictionary query method for electronic book and electronic equipment
US10289740B2 (en) Computer systems to outline search content and related methods therefor
CN104424300A (en) Personalized search suggestion method and device
CN105991312B (en) A kind of rearrangement and device of Internet resources
CN110399337B (en) File automation service method and system based on data driving

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant