CN105528411A

CN105528411A - Full-text retrieval device and method for interactive electronic technical manual of shipping equipment

Info

Publication number: CN105528411A
Application number: CN201510884252.XA
Authority: CN
Inventors: 马良荔; 覃基伟; 苏凯; 许国鹏
Original assignee: Naval University of Engineering PLA
Current assignee: Naval University of Engineering PLA
Priority date: 2015-12-03
Filing date: 2015-12-03
Publication date: 2016-04-27
Anticipated expiration: 2035-12-03
Also published as: CN105528411B

Abstract

The present invention discloses a full-text retrieval device for an interactive electronic technical manual of shipping equipment. The full-text retrieval device comprises a common source database, a specialized vocabulary extraction module, an abbreviation extraction module, a first segmentation module, a technical information term database, an equipment part name database, an abbreviation database, a general vocabulary database, a retrieval record database, a user retrieval command communication module, a retrieval module, a second segmentation module, an index database and an index module. Element label characteristics and document content in data module documents are composited, query is carried out by utilization of specialized vocabularies, weight of the specialized vocabularies in documents and retrieval keywords is increased, so that the system can carry out query in certain semantic levels, returned retrieved results are closer to retrieval intention of users, and therefore high recall rate and accuracy of the retrieval system are ensured.

Description

Apparel interactive electronic technical manual full-text search device and method

Technical field

The present invention relates to technical field of information retrieval, refer to a kind of apparel interactive electronic technical manual full-text search device and method particularly.

Technical background

The technical information major part of current apparel exists with paper-based form, causes the management role of technical information day by day heavy, and data repetition rate and redundance increase, and is difficult to upgrade, and data interoperability, transmission real-time and shared difficulty are large.In order to solve an above-mentioned difficult problem, usual establishment interactive electronic technical manual (IETM, InteractiveElectronicTechnicalManual) technical information is managed, namely according to the digital format authorizing standard of standard, adopt the forms such as word, figure, form, Voice & Video, the technical press of the content such as ultimate principle, the guarantee of operation use technology of this equipment is provided by man-machine interaction mode.The information related to due to IETM system is various, and user need use the fast finding of information retrieval function realization to required content usually, and wherein full-text search is one of the most frequently used method.In the text searching method of past IETM, most retrieval scheme adopting general field, does not take into full account the feature of professional domain technical information, causes result for retrieval undesirable.

Full-text search refers to the search method of all texts of document and search key being carried out mating.Due under Chinese linguistic context, space is not had as separator between word, obvious separator is not had between word, needing according to certain specification is word independent one by one by the cutting of Chinese character string, just can reach the effect of Computer Automatic Recognition statement implication, mate work with what complete document Chinese version and search key, therefore, Chinese words segmentation also becomes the core technology of Chinese Full Text Retrieval.In segmenting method conventional at present, segmenting method based on character string is most widely used method, the method to need the character string of participle to carry out mating the method obtaining word segmentation result according to certain strategy with a dictionary, and in professional domain, if lack specialized vocabulary in dictionary, segmenting method based on character string cannot obtain desirable participle effect, and in dictionary, the number of specialized vocabulary directly affects the accuracy rate of participle.

In apparel IETM field, main exist two class specialized vocabularies, and a class is apparel component names, as " SMR-7200 marine radar ", " 05106 current mode screw propeller anemoscope " etc.Another kind of is technical information term, as " tactical and technical norms ", " amplitude-comprised direction-finding principle ", " maintenance envelope diagram " etc.Therefore, the acquisition of this two classes specialized vocabulary is the problem that first IETM full-text search needs to solve, only have and utilize specialized vocabulary and universal word to carry out participle coupling to data module (DM, DataModel) document simultaneously, user's fast finding just can be made to required equipment technology information.

Apparel title full name complex structure; the various characters types such as numeral, symbol, letter are often comprised in title; user can use abbreviation to carry out alternative full name usually; as equipped title " H1604A ' Ilyushin Coase dignity ' number bulk goods wheel "; user uses usually " H1604A bulk goods wheel " or " Ilyushin Coase dignity " replaces; therefore, only comprise the full name of equipment title in dictionary not enough, the process of abbreviation is also that apparel IETM field participle mates the problem that cannot avoid.For equipment title, be mainly condensation from primitive to abbreviation form and cut slightly two kinds, condensation refers to that by primitive cutting be some parts, chooses the word that can represent original meaning in each several part or word combination becomes abbreviation, as " H1604A bulk goods wheel " in citing; Cut and slightly refer to that in acquisition primitive, one section of continuous print substring is as abbreviation, " Ilyushin Coase dignity " as above in example.

After solution specialized vocabulary acquisition problem, existing segmenting method does not mate for the feature of specialized vocabulary, and participle effect exists certain problem, therefore, need to combine the specific segmenting method that the feature design of extracting vocabulary is applicable to this field, to obtain best matching effect.

After retrieving required information, how to sort to multiple result for retrieval is also that full-text search apparatus and method need one of key problem solved, due to the element huge number of data module document, importance degree differs, and the importance degree of different document also there are differences, and the importance degree of different searching keywords is not identical yet, therefore, need the factor considering above three aspects, result for retrieval sort method reasonable in design, obtain making customer satisfaction system result for retrieval.

As can be seen from above content, specialized vocabulary acquisition, abbreviation acquisition, participle problem and result for retrieval sequence are the four major problems that current apparel IETM full-text search apparatus and method need solve.

Summary of the invention

Object of the present invention will provide a kind of apparel interactive electronic technical manual full-text search device and method exactly, and these apparatus and method can facilitate user to find required apparel technical information rapidly and accurately.

For realizing this object, the present invention is designed apparel interactive electronic technical manual full-text search device, it comprises database and functional module, wherein, described database comprises Common source database, technical information terminological data bank, equipment part name database, abbreviation database, universal word database, search records database and index data base, described functional module comprises specialized vocabulary extraction module, abbreviation extraction module, first participle module, user search command communication module, retrieval module, second word-dividing mode and index module, wherein Common source database provides word retrieval source for specialized vocabulary extraction module and abbreviation extraction module and provides the content of word segmentation processing for first participle module, specialized vocabulary extraction module is for extracting vocabulary and stored in technical information terminological data bank and equipment part name database, abbreviation extraction module is for extracting vocabulary stored in abbreviation database, first participle module is used for the participle content after by process and imports index module process,

Index module is used for setting up index and stored in index data base, searching database carries out matched and searched for the retrieval of content receiving the second word-dividing mode word segmentation processing, and be back to retrieval module sort mating the result set that obtains, retrieval module is used for that the retrieval of content of user is sent to the second word-dividing mode and carries out word segmentation processing, retrieval module is also for receiving the retrieval command of user search command communication module and returning the result set after sequence and be sent to user search command communication module, user search command communication module is used for the retrieval command of user to be sent to search records database, search records database is used for providing word retrieval source for abbreviation extraction module,

Described technical information terminological data bank, equipment part name database, abbreviation database and universal word database are respectively coupling word set when first participle module and the second word-dividing mode provide participle.

Utilize above-mentioned apparel interactive electronic technical manual full-text search device to carry out the method retrieved, it comprises the steps:

Step 1: import the data module document edited according to selected interactive electronic technical manual document preparation standard (i.e. S1000D standard) in Common source database, specialized vocabulary extraction module is according to the technical information term in data module document in the requirement extraction Common source database of described selected interactive electronic technical manual document preparation standard and equipment part title two class specialized vocabulary, and set up and mapping relations in corresponding data module documentation between data module coded message, and by above-mentioned two class specialized vocabularies and mapping relations stored in the technical information terminological data bank of correspondence and equipment part name database,

Step 2: abbreviation extraction module extracts the characteristic quantity of corresponding abbreviation from the equipment part title of Common source database, this characteristic quantity is numeral number in equipment part title or is commonly called as part;

Step 3: the user search record in data module document in above-mentioned characteristic quantity and Common source database and search records database is carried out matched and searched by abbreviation extraction module, determines the particular location of each element in data module document and user search record in characteristic quantity;

Step 4: the head and the tail character string of abbreviation extraction module determination characteristic quantity place abbreviation, and the border fragment of the corresponding abbreviation of recognition feature amount, make the abbreviation identified be complete abbreviation, this complete abbreviation be decided to be candidate's abbreviation;

Step 5: abbreviation extraction module calculates the weights of above-mentioned candidate's abbreviation by following formula 1:

W_{a} = \frac{n_{m i c}}{n_{a l l}} * \lg \frac{D_{a l l}}{D_{m i c}} - - - (1)

N in formula _micfor the number of times that candidate's abbreviation occurs in certain content, described certain content comprises the search key in the data module document content identical with the types of equipment identification code of equipment part title and this data module document content search records; n _allthe summation of occurrence number in all search records in the number of times occurred in all data module documents for candidate's abbreviation and search records database; D _allfor all data module total number of documents and all search records sum sum; D _micfor comprising the data module total number of documents of candidate's abbreviation and comprising the search records sum sum of candidate's abbreviation; W _afor the weights of candidate's abbreviation, weigh the ability of theme for weighing candidate's abbreviation, W _athreshold value be set-point, when the weights of candidate's abbreviation are more than or equal to W _athreshold value time, candidate's abbreviation can be considered formal abbreviation, and by candidate's abbreviation stored in abbreviation database, the weights of candidate's abbreviation are less than W _athreshold value time, candidate's abbreviation is not processed;

Step 6: respectively word segmentation processing is carried out to the user search keyword that data module documentation and retrieval module provide in first participle module and the second word-dividing mode; The detailed process of word segmentation processing is:

If character string to be slit is S ₁=w ₁w ₂w ₃w _iw _n, wherein, character string S to be slit ₁for each content in the character string of user search keyword or data module document, w _ifor S ₁in single character, n is the length of this character string, n>=1, and i is the character number between 1 to n;

Use abbreviation database to character string S to be slit ₁scan, when abbreviation hits, by character string S to be slit ₁the character substring of middle hit is reduced to corresponding primitive, until character string S to be slit ₁till scanned, now form character string S ₂=u ₁u ₂u _iu _m, wherein u _ifor S ₂in single character, m is the length of this character string;

Character string S is used in first participle module and the second word-dividing mode ₂set up the directed acyclic graph G that a nodes is m+1, the numbering of directed acyclic graph G node is followed successively by v ₀, v ₁, v ₂v _m, m is the length of this character string, in adjacent two vertex v _k, v _k+1between set up directed edge <v _k, v _k+1>, this directed edge <v _k, v _k+1the vocabulary that > is corresponding is u _k+1, (k=0,1,2...m-1, m are the length of this character string), if there is the directed edge be directly connected between any two directed acyclic graph G nodes, then thinks that these two internodal distances are 1, if character string S ₂character substring h ₁=u _pu _p+1u _q, (1≤p < q) is the primitive after abbreviation reduction, then with node v _p-1, v _qfor start node and terminating node set up directed edge <v _p-1, v _q>, this directed edge limit <v _p-1, v _qthe vocabulary that > is corresponding is character string S ₂character substring h ₁;

Operation technique information terminological data bank and equipment part name database are to character string S respectively ₂mate, if there is the maximum word length character substring h of coupling ₂=u _au _a+1u _b, (1≤a < b), and maximum word length character substring h ₂node v _a-1with node v _bbetween there is not directed edge <v _a-1, v _b>, and have a>=p+1 or b≤q-1 to set up, then with node v _a-1for start node, with node v _bfor terminating node sets up directed edge <v _a-1, v _b>, the corresponding vocabulary in this limit is maximum word length character substring h ₂;

Use universal word database to character string S ₂mate, if there is the character string h of coupling ₃=u _cu _c+1u _d, (1≤c < d), and character string h ₃node v _c-1and v _dbetween there is not directed edge <v _c-1, v _d>, then with character string h ₃node v _c-1for start node, with character string h ₃node v _dfor terminating node sets up directed edge <v _c-1, v _d>, this directed edge <v _c-1, v _dthe corresponding vocabulary of > is character string h ₃; If character string h ₃node v _c-1with node v _dbetween there is directed edge <v _c-1, v _d>, and directed edge <v _c-1, v _dthe character string type of > is maximum word length character substring h ₂, then maximum word length character substring h is described ₂exist in universal word database, therefore by maximum for its type word length character substring h ₂change character substring h into ₄;

After statistics directed edge generates in directed acyclic graph G from node v ₀arrive v _mpath front N paths from short to long, N elects 3 as, and a wherein the shortest paths considers all directed edge types, and it is h that character string type is all ignored in the second short path and the 3rd short path ₁and h ₂directed edge, be only h to corresponding vocabulary character string ₃and h ₄directed edge consider, namely in non-optimal path, only consider the matching result of general dictionary, reject the repetition directed edge existed in above-mentioned three paths, export respectively in each paths and remain vocabulary corresponding to directed edge, the result set of formation had both been final word segmentation result;

Step 7: first participle module by final word segmentation result obtained above respectively stored in each territory of index file in index data base, and the weighted value in each territory is set, each territory of index file comprises title field, territory, path, link text territory, subtitle territory and text field;

Step 8: the weight that index file in index data base is set, and multiple index file is formed section and finally forms index file; Index file weight arranges and is divided into standard encoding systems code weight to arrange and the setting of information code weight, according to data module document coding feature, the weight of various criterion coded system coding and information code is arranged, standard encoding systems code weight installation warrants standard encoding systems encoding equipment stratum level is lower, corresponding weight factor arranges higher rule, information code weight installation warrants subcategory information code arranges the rule of the weight than main classes Bie Genggao, then standard encoding systems code weight and information code multiplied by weight is obtained the weight of index file;

Step 9: utilize retrieval module to provide full article retrieval to user, retrieval module receives the retrieval request of user and calls inquiry mode and retrieve, this inquiry mode is specially: after the keyword invocation step 6 of user search is carried out word segmentation processing, in the index database formed with step 7, the participle content in each territory of document is mated, and the document searching all couplings collects as a result.

What the present invention is directed to that existing full-text search apparatus and method exist when apparel interactive electronic technical manual professional domain uses lacks specialized vocabulary and abbreviation thereof, lack the problem that adaptive segmentation methods and result for retrieval sequence not have optimization, by analyzing selected interactive electronic technical manual document preparation standard (i.e. S1000D standard) data module file structure and element-specific label feature, the specialized vocabulary type occurred in conjunction with apparel technical information and feature, complete the extraction of specialized vocabulary and abbreviation thereof, and according to multiclass Words ' Characteristics, design segmentation methods pointedly, quick position information is convenient to stored in index by after data module document content participle, and all kinds of factor weight value is set for solving result for retrieval sequencing problem, complete the structure of interactive electronic technical manual full-text search apparatus and method.Element tags feature and document content in this full-text search apparatus and method integrated data module documentation, utilize specialized vocabulary to carry out inquiring about and strengthen the weight of specialized vocabulary in document and search key, system can be inquired about at certain semantic hierarchies, the retrieval intention that the result for retrieval returned is close to the users more, thus ensure that high recall rate and the accuracy rate of this indexing unit.

Accompanying drawing explanation

Fig. 1 is the structural representation of apparel interactive electronic technical manual full-text search device in the present invention.

Wherein, 1-Common source database, 2-specialized vocabulary extraction module, 3-abbreviation extraction module, 4-first participle module, 5-technical information terminological data bank, 6-equipment part name database, 7-abbreviation database, 8-universal word database, 9-search records database, 10-user search command communication module, 11-retrieval module, the 12-the second word-dividing mode, 13-index data base, 14-index module.

Embodiment

Below in conjunction with the drawings and specific embodiments, the present invention is described in further detail:

Apparel interactive electronic technical manual full-text search device as shown in Figure 1, it comprises database and functional module, wherein, described database comprises Common source database 1, technical information terminological data bank 5, equipment part name database 6, abbreviation database 7, universal word database 8, search records database 9 and index data base 13, described functional module comprises specialized vocabulary extraction module 2, abbreviation extraction module 3, first participle module 4, user search command communication module 10, retrieval module 11, second word-dividing mode 12 and index module 14, wherein Common source database 1 provides word retrieval source for specialized vocabulary extraction module 2 and abbreviation extraction module 3 and provides the content of word segmentation processing for first participle module 4, specialized vocabulary extraction module 2 is for extracting vocabulary and stored in technical information terminological data bank 5 and equipment part name database 6, abbreviation extraction module 3 is for extracting vocabulary stored in abbreviation database 7, first participle module 4 processes for the participle content after process is imported index module 14,

Index module 14 is for setting up index and stored in index data base 13, searching database 13 carries out matched and searched for the retrieval of content receiving the second word-dividing mode 12 word segmentation processing, and be back to retrieval module 11 sort mating the result set that obtains, retrieval module 11 carries out word segmentation processing for the retrieval of content of user being sent to the second word-dividing mode 12, retrieval module 11 is also for receiving the retrieval command of user search command communication module 10 and returning the result set after sequence and be sent to user search command communication module 10 to check, user search command communication module 10 is for being sent to search records database 9 by the retrieval command of user, search records database 9 is for providing word retrieval source for abbreviation extraction module 3,

Described technical information terminological data bank 5, equipment part name database 6, abbreviation database 7 and universal word database 8 are respectively coupling word set when first participle module 4 and the second word-dividing mode 12 provide participle.

Step 1: import the data module document according to selected interactive electronic technical manual document (the present embodiment is chosen as the interactive electronic technical manual document of S1000D) standard of compiling editor in Common source database 1, specialized vocabulary extraction module 2 extracts data module document (DM in Common source database 1 according to the requirement of described selected interactive electronic technical manual document preparation standard, DataModel) the technical information term in and equipment part title two class specialized vocabulary, and set up and mapping relations in corresponding data module documentation between data module coded message, and by above-mentioned two class specialized vocabularies and mapping relations stored in the technical information terminological data bank 5 of correspondence and equipment part name database 6,

Step 2: abbreviation extraction module 3 extracts the characteristic quantity of corresponding abbreviation from the equipment part title (full name) of Common source database 1, this characteristic quantity is numeral number in equipment part title or is commonly called as part and (such as equips title primitive " H1604A ' Ilyushin Coase dignity ' number bulk goods wheel ", its abbreviation must comprise numeral number " 1604 " and be commonly called as " Ilyushin Coase dignity " or the two one of, therefore, the position that this type of characteristic quantity location abbreviation may exist can be utilized, string matching before and after other character strings of primitive of recycling equipment title and characteristic quantity, identify the border fragment of abbreviation, the abbreviation identified is made to comprise most long word, calculate these abbreviation weights and decision threshold, build equipment title primitive and abbreviation between mapping relations and stored in abbreviation dictionary, complete abbreviation to extract),

Above-mentioned abbreviation extraction module 3 extracts the concrete grammar of corresponding abbreviation characteristic quantity from the equipment part title (full name) of Common source database 1, comprises the steps: the characteristic quantity of the abbreviation extracted in apparel title primitive; Because every class apparel has fixing naming rule, this naming rule therefore can be utilized to judge equipment name type and carry out cutting according to the constituent of rule to equipment title, completing the extraction of characteristic quantity, if apparel title primitive is W ₀=w ₁w ₂w _n, w _ifor i-th character of title primitive, first the grammer instruments such as JAPE (aJavaAnnotationPatternsEngine) are used to formulate the regular expression of all kinds of equipment naming rule, each W in the equipment part title dictionary using these regular expression determination steps 1 to be formed ₀affiliated name type, and according to the rule of hitting to W ₀carry out cutting, obtain abbreviation characteristic quantity W ₁=w _pw _q, 1≤p<q≤n;

Step 3: the user search record in data module document in above-mentioned characteristic quantity and Common source database 1 and search records database 9 is carried out matched and searched by abbreviation extraction module 3, determine the particular location of each element in data module document and user search record in characteristic quantity, the concrete grammar of step 3 is for setting hit character string as W ₂, then W is met ₂=W ₁, become abbreviation candidate, W for preventing the character string of uncorrelated IETM system ₂the types of equipment identification code MIC of residing data module document D M or the corresponding access links of search records must meet and W ₁corresponding primitive W ₀to map types of equipment identification code MIC identical;

Step 4: abbreviation extraction module 3 determines the head and the tail character string of characteristic quantity place abbreviation, and the border fragment of the corresponding abbreviation of recognition feature amount, the abbreviation identified is made to be complete abbreviation, this complete abbreviation is decided to be candidate's abbreviation (such as, " HMZ-360 radar identification target ", the words, " 360 " are characteristic quantities, " HMZ-360 radar " is the most long word of abbreviation, identifies not exclusively if only recognize " HMZ-360 " or " 360 radar ");

Step 5: abbreviation extraction module 3 calculates the weights of above-mentioned candidate's abbreviation by following formula 1:

W_{a} = \frac{n_{m i c}}{n_{a l l}} * \lg \frac{D_{a l l}}{D_{m i c}} - - - (1)

N in formula _micfor the number of times that candidate's abbreviation occurs in certain content, described certain content comprises the search key in the data module document content identical with the types of equipment identification code (MIC, Modelidentificationcode) of equipment part title and this data module document content search records; n _allin the number of times occurred in all data module documents for candidate's abbreviation and search records database 9 in all search records the summation of occurrence number (business of the two weighs candidate's abbreviation word frequency, this value is higher, illustrate that candidate's abbreviation occurrence number in specific IETM system is more); D _allfor all data module total number of documents and all search records sum sum; D _micin order to comprise the data module total number of documents of candidate's abbreviation, (this logarithm value is for weighing the ubiquity of candidate's abbreviation with the search records sum sum comprising candidate's abbreviation, this value is higher, illustrates that candidate's abbreviation concentrates on minority data module document and occurs); W _afor the weights of candidate's abbreviation, weigh the ability of theme for weighing candidate's abbreviation, W _athreshold value be set-point, this threshold value is set as 2, when the weights of candidate's abbreviation are more than or equal to W _athreshold value time (illustrating that it is higher with the IETM system theme degree of association of specifically equipping), candidate's abbreviation can be considered formal abbreviation, and by candidate's abbreviation stored in abbreviation database 7, the weights of candidate's abbreviation are less than W _athreshold value time, candidate's abbreviation is not processed;

Step 6: respectively word segmentation processing is carried out to the user search keyword that data module documentation and retrieval module 11 provide in first participle module 4 and the second word-dividing mode 12, extract in the multiclass vocabulary formed at specialized vocabulary extraction module 2 and abbreviation extraction module 3, there is the compound vocabulary combined by multiple simple words, by there are many correct path after dictionary cutting in these vocabulary, cutting can be continued for " radar/test/device " as equipped title " radar testing device ", if only adopt single cutting result for this kind of compound vocabulary, be rejected causing matching way correct in a large number, obtain word segmentation result cannot meet the demand of user search, the present invention adopts on the basis of original N-shortest path segmenting method, in conjunction with Words ' Characteristics in the multiclass specialized vocabulary dictionary generated and existing universal word dictionary, when carrying out participle, carry out 3 dictionary matching processs altogether, first the abbreviation dictionary utilizing step 2 to obtain mates, the abbreviation existed in scanning technique information, and be reduced to corresponding equipment part title primitive, secondly the technical information term dictionary obtained by step 1 and equipment part title dictionary mate miss content of text, then by general dictionary, all content of text after reduction primitive are mated, after coupling, export satisfactory N paths, the result set that mulitpath is formed is final word segmentation result, and the detailed process of word segmentation processing is:

Use abbreviation database 7 to character string S to be slit ₁scan, when abbreviation hits, by character string S to be slit ₁the character substring of middle hit is reduced to corresponding primitive, until character string S to be slit ₁till scanned, now form character string S ₂=u ₁u ₂u _iu _m, wherein u _ifor S ₂in single character, m is the length of this character string;

Character string S is used in first participle module 4 and the second word-dividing mode 12 ₂set up the directed acyclic graph G that a nodes is m+1, the numbering of directed acyclic graph G node is followed successively by v ₀, v ₁, v ₂v _m, m is the length of this character string, in adjacent two vertex v _k, v _k+1between set up directed edge <v _k, v _k+1>, this directed edge <v _k, v _k+1the vocabulary that > is corresponding is u _k+1, (k=0,1,2...m-1, m are the length of this character string), if there is the directed edge be directly connected between any two directed acyclic graph G nodes, then thinks that these two internodal distances are 1, if character string S ₂character substring h ₁=u _pu _p+1u _q, (1≤p < q) is the primitive after abbreviation reduction, then with node v _p-1, v _qfor start node and terminating node set up directed edge <v _p-1, v _q>, this directed edge limit <v _p-1, v _qthe vocabulary that > is corresponding is character string S ₂character substring h ₁;

Operation technique information terminological data bank 5 and equipment part name database 6 couples of character string S respectively ₂mate, if there is the maximum word length character substring h of coupling ₂=u _au _a+1u _b, (1≤a < b), and maximum word length character substring h ₂node v _a-1with node v _bbetween there is not directed edge <v _a-1, v _b>, and have a>=p+1 or b≤q-1 to set up, then with node v _a-1for start node, with node v _bfor terminating node sets up directed edge <v _a-1, v _b>, the corresponding vocabulary in this limit is maximum word length character substring h ₂;

Use universal word database 8 couples of character string S ₂mate, if there is the character string h of coupling ₃=u _cu _c+1u _d, (1≤c < d), and character string h ₃node v _c-1and v _dbetween there is not directed edge <v _c-1, v _d>, then with character string h ₃node v _c-1for start node, with character string h ₃node v _dfor terminating node sets up directed edge <v _c-1, v _d>, this directed edge <v _c-1, v _dthe corresponding vocabulary of > is character string h ₃; If character string h ₃node v _c-1with node v _dbetween there is directed edge <v _c-1, v _d>, and directed edge <v _c-1, v _dthe character string type of > is maximum word length character substring h ₂, then maximum word length character substring h is described ₂exist in universal word database 8, therefore by maximum for its type word length character substring h ₂change character substring h into ₄, be convenient to follow-up output processing;

After statistics directed edge generates in directed acyclic graph G from node v ₀arrive v _mpath front N paths from short to long, N elects 3 as, and a wherein the shortest paths considers all directed edge types, and it is h that character string type is all ignored in the second short path and the 3rd short path ₁and h ₂directed edge, be only h to corresponding vocabulary character string ₃and h ₄directed edge consider, namely in non-optimal path, only consider that the matching result of general dictionary (prevents above N-shortest path segmenting method 3 cuttings from also cannot meet Search Requirement, avoid the excessive situation that just can reach good cutting granularity of N value), reject the repetition directed edge existed in above-mentioned three paths, export respectively in each paths and remain vocabulary corresponding to directed edge, the result set of formation had both been final word segmentation result;

Step 7: in first participle module 4 by each territory of final word segmentation result obtained above respectively stored in index file in index data base 13, and the weighted value in each territory is set, for the sequence of final result for retrieval provides parameter, multiple document forms section and finally forms index file, stored in disk or internal memory, each territory of index file comprises title field, territory, path, link text territory, subtitle territory and text field;

Step 8: the weight that index file in index data base 13 is set, and multiple index file is formed section and finally forms index file, and stored in disk or internal memory, index file weight arranges and is divided into standard encoding systems (StandardNumberingSystems, SNS) code weight is arranged and the setting of information code weight, according to data module document coding feature, the weight of various criterion coded system coding and information code is arranged, standard encoding systems code weight installation warrants standard encoding systems encoding equipment stratum level is lower, corresponding weight factor arranges higher rule, information code weight installation warrants subcategory information code arranges the rule of the weight than main classes Bie Genggao, then standard encoding systems code weight and information code multiplied by weight are obtained the weight of index file,

Step 9: utilize retrieval module 11 to provide full article retrieval to user, retrieval module 11 receives the retrieval request of user and calls inquiry mode and retrieve, this inquiry mode is specially: after the keyword invocation step 6 of user search is carried out word segmentation processing, in the index database formed with step 7, the participle content in each territory of document is mated, and the document searching all couplings collects as a result.

In the step 7 of technique scheme, the weighted value installation warrants of each territory of index file and correspondence is as follows:

The word segmentation result of title field store data module title <dmtitle>, appear at the theme of the entry reflection entire chapter data module document of title field, the weight of title field is set to 10;

Territory, path is used for identification documents access path, and store data module coding information realizes identification path function, and territory, path does not participate in participle and retrieving, and territory, path is without the need to arranging weight;

The word segmentation result that link text territory is used for store data module coding link reduction content of text is (the same with inside webpage, link is there is in data module content, link and occur with the form of data module coding, user can access other data module by clickthrough, map being formed between data module coding and vocabulary in step 1, herein for utilizing this mapping relations that coding is reduced to the result of vocabulary content then participle), also for realizing the retrieval to link anchor text, when search key hits in link text territory, the content that the data module document module that link is pointed to may be searched for user, the weight in link text territory is set to 3,

Subtitle territory is used for the word segmentation result depositing reflection local topic information <title> (label of local topic, local topic content is deposited in the inside), and the weight in subtitle territory is set to 5;

Text field is used for other technical information participle (other technical information is the body matter except subtitle and link information) result in store data module documentation, and the weight of text field is set to 1.

The step 1 of technique scheme, specifically comprises the steps:

Step 101: choose certain text content and extract equipment part title and technical information term two class specialized vocabulary, wherein element-specific comprises technical name <techname> and name of the information <infoname>, in data module title, technical name <techname> is for describing equipment part title, name of the information <infoname> is used for description technique information term, therefore the text message extracting this two dvielement completes the extraction of specialized vocabulary,

Step 102: set up specialized vocabulary and corresponding data module coding (DataModelCode, DMC) mapping relations between information, mapping relations wherein refer to standard encoding systems (StandardNumberingSystems, SNS) and between equipment part title, mapping relations between information code <incode> and technical information term, links and accesses information is the resource that in retrieving, a part is important, but the link due to data module document is quoted and is not provided Anchor Text information, but data module coding realizes by reference, therefore need that data module coded message is reduced to text and just can enter range of search, the accurate coded system SNS of daughter element of data module coding is for describing the hierarchical location of assembly in whole equipment of current data block document description, the equipment part title that therefore can describe with technical name <techname> forms mapping relations, thus utilize equipment part title to complete retrieval to coded system SNS, set up the mapping relations between the daughter element information code <incode> of data module encoding D MC and name of the information <infoname>, technical information term is utilized to complete the retrieval of information code due in different apparel interactive electronic technical manual IETM systems, identical technical information or coding corresponding to equipment part title may be different, in order to the situation preventing this mapping inconsistent, corresponding types of equipment identification code (Modelidentificationcode is added in corresponding information code and coded system SNS code, MIC), MIC code plays the effect of definition equipment title and model, it is the coding uniquely determining to equip that authoritative institution formulates,

Step 103: by extract vocabulary with corresponding coded message respectively stored in equipment part title dictionary and technical information term dictionary, wherein equipment part title dictionary is for depositing the coded system SNS coded message of equipment title or parts title and correspondence, and technical information term dictionary is for depositing the information code coded message of technical information term and correspondence.

In the step 4 of technique scheme, because apparel abbreviation occurs with condensation and slightly two kinds of forms of cutting, therefore the character string occurred in abbreviation must be character in primitive (being the full name that " abbreviation " is corresponding), and it is constant to meet the relative primitive that puts in order of character in abbreviation; Read in W ₂a character on left side or right side, if this candidate characters is w _c, judge w _cat W ₀in whether exist and meet and W ₂put in order at W ₀in do not change, if satisfied condition, then judge w _cfor the border character of candidate's abbreviation, make W ₂equal w _cw ₂or W ₂w _cif do not satisfy condition, then w _cbe not character in abbreviation, current direction character judges to stop, and border is determined, repeats above process, until the character boundary of both direction judges all to stop, and W now ₂for final candidate's abbreviation.

In the step 7 of technique scheme, index is used for the text message needed for quick position, thus avoid read-write operations a large amount of in retrieving, index uses specific data structure to complete quick position to entry, the present invention is on the basis of general full-text search kit Lucene, design is applicable to the index structure of IETM full-text search apparatus and method, index structure in Lucene is divided into index, section, document, territory and entry Pyatyi level altogether from high to low, wherein entry is the base unit of index, deposits each character string after word segmentation processing; Territory is for comprising in single section of document the different information of separately index, and as title, text, link, territory is that user can the structure of designed, designed, to realize the retrieval to dissimilar document; Document is the base unit setting up index, and in the present invention, an index file deposits the information after the process of a data module documentation; Section is made up of multiple document, and can be considered a small-sized index, multiple sections finally form index.

In the step 8 of technique scheme, standard encoding systems (StandardNumberingSystems, SNS) the equipment part level that code weight represents according to standard encoding systems is determined, the numeral of SNS code describes the equipment level in current data block residing for equipment part, SNS code 00-00-00, 0a-00-00, 0a-b0-00, 0a-bd-00 and 0a-bd-fg, (a ≠ 0, b ≠ 0, d ≠ 0, ∪ g ≠ 0, f ≠ 0) respectively describes in bear layer aggregated(particle) structure and is in equipment level, system-level, subsystem irrespective of size, the equipment part of subsystem irrespective of size and more bottom equipment partition level, when search key hit document, SNS code layer time higher data module document may only have local content and user's information needed to link up with, the ratio that the data module document reflection user information needed that SNS code layer is time lower on the contrary accounts for document content is higher, therefore, SNS code equipment stratum level is lower, and the weight factor of corresponding document arranges higher, equipment level, system-level, subsystem irrespective of size, the SNS code weight of subsystem irrespective of size and more bottom equipment partition level is set to 1 heavily respectively, 2, 3, 4 and 5,

The information category size of information code weight described by this information code is determined, information code a00 and abc, (b ≠ 0, c ≠ 0) respectively describe large classification and the subclass of technical information, when search key hit document, the possibility of the information code rank that granularity is less and the relevance needed for user is higher, therefore, subcategory information code arranges the weight higher than large classification, and it is 1 that the present invention arranges large class weight value, and subclass weighted value is 2.

In the step 9 of technique scheme, the sort by vector space model (VSM, VectorSpaceModel) of result set calculates, and concrete formula is as follows:

\{\begin{matrix} c o o r d (q, d) = {Num}_{d t} / {Num}_{q t} \\ S_{q d} = c o o r d (q, d) * q u e r y t n o r m (q) * S_{d t} \\ S_{d t} = Σ_{i = 1}^{n} (t f (t_{i}, d) * i d f {(t_{i})}^{2} * {Boost}_{t i} * n o r m (t_{i}, d)) \\ n o r m (t, d) = {Boost}_{d} * Π {Boost}_{f} / \sqrt{{Num}_{t e r m}} \end{matrix} - - - (2)

If document is d in index, the search key of user is q, q result after point word segmentation is t ₁/ t ₂/ ... / t _n(S _dti from 1 to n result, the inside comprises t _n), wherein n is the entry sum after cutting, t _ifor single keyword entry, n>=1, i is the character number between 1 to n, S _qdrepresent the score of mating search key q in index file d, for sort result factor, its value is higher, in result set, document ordering is more forward, coord (q, d) for weighing the number not repeating entry in index file d, by there is not repetitor bar number Num in computation index document d _dtwith not repetitor bar number Num in search key q _qtbusiness obtain, querytnorm (q) is regulatory factor, on marking ranking results do not affect, the size of this value for integrally-regulated score can be set, S _dtrepresent in index file d, hit all single keyword entry t _iscore and, tf (t _i, d) represent single keyword entry t _iin the word frequency score that index file d occurs, idf (t _i) represent single keyword entry t _iat how many documents occurred, this value is higher, and t is described _ithe document occurred is fewer, single keyword entry t _ilarger with particular topic correlativity, Boost _tifor single keyword entry t _iweight, according to keyword entry t single during participle _imate dictionary and determine, the weight that norm (t, d) is index file d and length factor gather value, wherein Boost _dfor index file d weight, this value size arranges according to the index file of index module each territory weight described in step 7 and decides, Boost _ffor hitting single keyword entry t in index file d _ithe weight in territory, this value size arranges decision, Num according to the index file of index module each territory weight described in step 7 _termbe the cutting entry sum in index file d, this value is larger, and norm (t, d) score is lower;

The weight of described search key bar decides according to coupling dictionary type during participle, and installation warrants is as follows:

(1) the entry reflection user search intention of hitting from abbreviation dictionary, technical information term dictionary and equipment part title dictionary is comparatively large, and weighted value is set to 5.

(2) the dictionary reflection user search intention of mating in general dictionary is comparatively unilateral, and weighted value is set to 2.

(3) the individual character granularity of division occurred in participle process is meticulous, and the noise data caused during retrieval is too much, and weighted value is set to 1.

After sequence, retrieval module is with the ranking results of certain forms Output rusults collection, the results page every page returned ten result for retrieval, each result exports the information segment the entry adding red highlighted hit that hit entry place, and provide title and the data module coding (DataModelCode of hit document, DMC) information, user accesses former data module document by the hyperlink clicking title.

The content that this instructions is not described in detail belongs to the known prior art of professional and technical personnel in the field.

Claims

1. an apparel interactive electronic technical manual full-text search device, it is characterized in that: it comprises database and functional module, wherein, described database comprises Common source database (1), technical information terminological data bank (5), equipment part name database (6), abbreviation database (7), universal word database (8), search records database (9) and index data base (13), described functional module comprises specialized vocabulary extraction module (2), abbreviation extraction module (3), first participle module (4), user search command communication module (10), retrieval module (11), second word-dividing mode (12) and index module (14), wherein Common source database (1) provides word retrieval source for specialized vocabulary extraction module (2) and abbreviation extraction module (3) and provides the content of word segmentation processing for first participle module (4), specialized vocabulary extraction module (2) is for extracting vocabulary and stored in technical information terminological data bank (5) and equipment part name database (6), abbreviation extraction module (3) is for extracting vocabulary stored in abbreviation database (7), first participle module (4) is for importing index module (14) process by the participle content after process,

Index module (14) is for setting up index and stored in index data base (13), searching database (13) carries out matched and searched for the retrieval of content receiving the second word-dividing mode (12) word segmentation processing, and be back to retrieval module (11) sort mating the result set that obtains, retrieval module (11) carries out word segmentation processing for the retrieval of content of user being sent to the second word-dividing mode (12), retrieval module (11) is also for receiving the retrieval command of user search command communication module (10) and returning the result set after sequence and be sent to user search command communication module (10), user search command communication module (10) is for being sent to search records database (9) by the retrieval command of user, search records database (9) is for providing word retrieval source for abbreviation extraction module (3),

Described technical information terminological data bank (5), equipment part name database (6), abbreviation database (7) and universal word database (8) are respectively coupling word set when first participle module (4) and the second word-dividing mode (12) provide participle.

2. utilize apparel interactive electronic technical manual full-text search device described in claim 1 to carry out the method retrieved, it is characterized in that, it comprises the steps:

Step 1: import in Common source database (1) according to the selected standard compliant data module document of interactive electronic technical manual document preparation, specialized vocabulary extraction module (2) is according to the technical information term in requirement extraction Common source database (1) the interior data module document of described selected interactive electronic technical manual document preparation standard and equipment part title two class specialized vocabulary, and set up and mapping relations in corresponding data module documentation between data module coded message, and by above-mentioned two class specialized vocabularies and mapping relations stored in the technical information terminological data bank (5) of correspondence and equipment part name database (6),

Step 2: abbreviation extraction module (3) extracts the characteristic quantity of corresponding abbreviation from the equipment part title of Common source database (1), this characteristic quantity is numeral number in equipment part title or is commonly called as part;

Step 3: the user search record in above-mentioned characteristic quantity and the interior data module document of Common source database (1) and search records database (9) is carried out matched and searched by abbreviation extraction module (3), determines the particular location of each element in data module document and user search record in characteristic quantity;

Step 4: abbreviation extraction module (3) determines the head and the tail character string of characteristic quantity place abbreviation, and the border fragment of the corresponding abbreviation of recognition feature amount, make the abbreviation identified be complete abbreviation, this complete abbreviation be decided to be candidate's abbreviation;

Step 5: abbreviation extraction module (3) calculates the weights of above-mentioned candidate's abbreviation by following formula 1:

W_{a} = \frac{n_{m i c}}{n_{a l l}} * \lg \frac{D_{a l l}}{D_{m i c}} - - - (1)

N in formula _micfor the number of times that candidate's abbreviation occurs in certain content, described certain content comprises the search key in the data module document content identical with the types of equipment identification code of equipment part title and this data module document content search records; n _allthe summation of occurrence number in all search records in the number of times occurred in all data module documents for candidate's abbreviation and search records database (9); D _allfor all data module total number of documents and all search records sum sum; D _micfor comprising the data module total number of documents of candidate's abbreviation and comprising the search records sum sum of candidate's abbreviation; W _afor the weights of candidate's abbreviation, weigh the ability of theme for weighing candidate's abbreviation, W _athreshold value be set-point, when the weights of candidate's abbreviation are more than or equal to W _athreshold value time, candidate's abbreviation can be considered formal abbreviation, and by candidate's abbreviation stored in abbreviation database (7), the weights of candidate's abbreviation are less than W _athreshold value time, candidate's abbreviation is not processed;

Step 6: respectively word segmentation processing is carried out to the user search keyword that data module documentation and retrieval module (11) provide in first participle module (4) and the second word-dividing mode (12); The detailed process of word segmentation processing is:

Use abbreviation database (7) to character string S to be slit ₁scan, when abbreviation hits, by character string S to be slit ₁the character substring of middle hit is reduced to corresponding primitive, until character string S to be slit ₁till scanned, now form character string S ₂=u ₁u ₂u _iu _m, wherein u _ifor S ₂in single character, m is the length of this character string;

Character string S is used in first participle module (4) and the second word-dividing mode (12) ₂set up the directed acyclic graph G that a nodes is m+1, the numbering of directed acyclic graph G node is followed successively by v ₀, v ₁, v ₂v _m, m is the length of this character string, in adjacent two vertex v _k, v _k+1between set up directed edge <v _k, v _k+1>, this directed edge <v _k, v _k+1the vocabulary that > is corresponding is u _k+1, (k=0,1,2...m-1, m are the length of this character string), if there is the directed edge be directly connected between any two directed acyclic graph G nodes, then thinks that these two internodal distances are 1, if character string S ₂character substring h ₁=u _pu _p+1u _q, (1≤p < q) is the primitive after abbreviation reduction, then with node v _p-1, v _qfor start node and terminating node set up directed edge <v _p-1, v _q>, this directed edge limit <v _p-1, v _qthe vocabulary that > is corresponding is character string S ₂character substring h ₁;

Operation technique information terminological data bank (5) and equipment part name database (6) are to character string S respectively ₂mate, if there is the maximum word length character substring h of coupling ₂=u _au _a+1u _b, (1≤a < b), and maximum word length character substring h ₂node v _a-1with node v _bbetween there is not directed edge <v _a-1, v _b>, and have a>=p+1 or b≤q-1 to set up, then with node v _a-1for start node, with node v _bfor terminating node sets up directed edge <v _a-1, v _b>, the corresponding vocabulary in this limit is maximum word length character substring h ₂;

Use universal word database (8) to character string S ₂mate, if there is the character string h of coupling ₃=u _cu _c+1u _d, (1≤c < d), and character string h ₃node v _c-1and v _dbetween there is not directed edge <v _c-1, v _d>, then with character string h ₃node v _c-1for start node, with character string h ₃node v _dfor terminating node sets up directed edge <v _c-1, v _d>, this directed edge <v _c-1, v _dthe corresponding vocabulary of > is character string h ₃; If character string h ₃node v _c-1with node v _dbetween there is directed edge <v _c-1, v _d>, and directed edge <v _c-1, v _dthe character string type of > is maximum word length character substring h ₂, then maximum word length character substring h is described ₂exist in universal word database (8), therefore by maximum for its type word length character substring h ₂change character substring h into ₄;

Step 7: in first participle module (4) by each territory of final word segmentation result obtained above respectively stored in index file in index data base (13), and the weighted value in each territory is set, each territory of index file comprises title field, territory, path, link text territory, subtitle territory and text field;

Step 8: the weight that index data base (13) interior index file is set, and multiple index file is formed section and finally forms index file; Index file weight arranges and is divided into standard encoding systems code weight to arrange and the setting of information code weight, according to data module document coding feature, the weight of various criterion coded system coding and information code is arranged, standard encoding systems code weight installation warrants standard encoding systems encoding equipment stratum level is lower, corresponding weight factor arranges higher rule, information code weight installation warrants subcategory information code arranges the rule of the weight than main classes Bie Genggao, then standard encoding systems code weight and information code multiplied by weight is obtained the weight of index file;

Step 9: utilize retrieval module (11) to provide full article retrieval to user, retrieval module (11) receives the retrieval request of user and calls inquiry mode and retrieve, this inquiry mode is specially: after the keyword invocation step 6 of user search is carried out word segmentation processing, in the index database formed with step 7, the participle content in each territory of document is mated, and the document searching all couplings collects as a result.

3. search method according to claim 2, is characterized in that: in described step 7, the word segmentation result of title field store data module title, and appear at the theme of the entry reflection entire chapter data module document of title field, the weight of title field is set to 10.

4. search method according to claim 2, it is characterized in that: in described step 7, territory, path is used for identification documents access path, and store data module coding information realizes identification path function, territory, path does not participate in participle and retrieving, and territory, path is without the need to arranging weight.

5. search method according to claim 2, it is characterized in that: in described step 7, link text territory is used for the word segmentation result of store data module coding link reduction content of text, also for realizing the retrieval to link anchor text, when search key hits in link text territory, the content that the data module document module that link is pointed to may be searched for user, the weight in link text territory is set to 3.

6. search method according to claim 2, is characterized in that: in described step 7, and subtitle territory is for depositing the word segmentation result of the label of reflection local topic information, and the weight in subtitle territory is set to 5.

7. search method according to claim 2, is characterized in that: in described step 7, and text field is used for other technical information word segmentation result in store data module documentation, and the weight of text field is set to 1.