CN110569420A - Search method based on chemical industry - Google Patents

Search method based on chemical industry Download PDF

Info

Publication number
CN110569420A
CN110569420A CN201910779876.3A CN201910779876A CN110569420A CN 110569420 A CN110569420 A CN 110569420A CN 201910779876 A CN201910779876 A CN 201910779876A CN 110569420 A CN110569420 A CN 110569420A
Authority
CN
China
Prior art keywords
data
compound
dimension
database
structural formula
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910779876.3A
Other languages
Chinese (zh)
Inventor
曹磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Moku Data Technology Co Ltd
Original Assignee
Shanghai Moku Data Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Moku Data Technology Co Ltd filed Critical Shanghai Moku Data Technology Co Ltd
Priority to CN201910779876.3A priority Critical patent/CN110569420A/en
Publication of CN110569420A publication Critical patent/CN110569420A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a search method based on the chemical industry, which comprises the following steps: data storage, wherein the compound to be stored is stored in a database in various forms; receiving an input search description, and analyzing structural formula data of a reference compound corresponding to the search description; pre-searching is carried out, if structural formula data of the compound matched with the structural formula data of the reference compound exist in the database in the pre-searching step, the ID of the target compound corresponding to the structural formula data of the successfully matched compound is determined, and the ID of the target compound is fed back; otherwise, selecting one searching mode in the structural formula searching for searching, wherein the searching mode of the structural formula searching comprises accurate searching, substructure searching and similar searching. The searching speed is accelerated by enabling various forms of the compound to participate in the searching; by adopting the structural search, the accurate search, the substructure search and the similar search are realized, thereby meeting the search requirements of users and improving the search quality.

Description

search method based on chemical industry
Technical Field
The invention relates to the technical field of data search, in particular to a search method based on the chemical industry.
background
with the development of society, search technologies are distributed to all websites, whether proprietary search websites or all e-commerce websites and even company internal systems, without departing from the search technologies. The searching speed directly influences the experience of users, the background data volume continuously increases due to the increase of services and the accumulation of time, and when the first two million data are expanded to the next thousands of millions of data, the structural formula search supporting 100W-200W by mysql in the past cannot meet the current searching requirement.
moreover, the search of the chemical industry is different from the ordinary data search, the compounds in the chemical industry and the like have unique writing modes, for example, the original structural formula, the molecular weight and the like of the compounds are likely to participate in the search, but the existing ordinary searching modes are mostly literal searches, and have great difference with the search of the chemical industry, and the prior art does not introduce how to search the chemical industry at present, and does not have related technologies to improve the search rate of the chemical industry.
therefore, it is necessary to provide a search method based on the chemical industry, so as to implement the search of the chemical industry and effectively improve the search rate and the accuracy of the search result.
Disclosure of Invention
The invention aims to provide a searching method based on the chemical industry, which realizes the searching of the chemical industry and effectively improves the searching speed and the accuracy of the searching result.
In order to solve the problems in the prior art, the invention provides a searching method based on the chemical industry, which comprises the following steps:
data storage, wherein the compound to be stored is stored in a database in various forms;
receiving an input search description, and analyzing structural formula data of a reference compound corresponding to the search description;
pre-searching is carried out, if structural formula data of the compound matched with the structural formula data of the reference compound exist in a database in the pre-searching step, the ID of the target compound corresponding to the structural formula data of the successfully matched compound is determined, and the ID of the target compound is fed back;
and if the structural formula data of the compound which is successfully matched in the database is not searched in advance, selecting one searching mode in structural formula searching for searching, wherein the searching mode of the structural formula searching comprises accurate searching, substructure searching and similar searching.
Optionally, in the chemical industry-based search method, the database used by the chemical industry-based search method includes pgsql, and the data storage includes the following steps:
storing structural formula data of the compound to be stored;
encrypting the structural formula data of the compound to generate compound encrypted data, and storing the compound encrypted data;
Analyzing structural formula data of the compound to generate data of a first dimension, data of a second dimension, data of a third dimension and data of a fourth dimension, and storing the data of the four dimensions in a database;
the fixed fingerprint value for the structural formula search is stored.
optionally, in the chemical industry-based search method,
the data of the first dimension comprises a mol _ weight attribute, which is the molecular weight of the compound;
The data of the second dimension comprises dfp01 attributes;
The data of the third dimension includes the following attributes: n _ atoms, n _ bonds, n _ rings, n _ C2, n _ C, n _ b1, n _ b2, n _ bar, n _ r6, and n _ rar.
optionally, in the chemical industry-based search method, the pre-search includes the following steps:
verifying structural formula data of the reference compound;
standardizing the structural formula data of the reference compound which accords with the rules after the verification;
encrypting the standardized structural formula data of the parameter-entering compound to generate parameter-entering encryption data;
searching a database for compound encryption data that matches the enrollment encryption data;
if the database has successfully matched compound encrypted data, searching structural formula data of a compound corresponding to the successfully matched compound encrypted data, determining the ID of the target compound, and feeding back the ID of the target compound;
And if the database does not have the successfully matched compound encrypted data, carrying out structural formula search according to the selection of the user on the search mode of the structural formula search.
optionally, in the chemical industry-based search method, the accurate search includes the following steps:
carrying out first analysis on the normalized structural formula data of the parameter-entering compound, and obtaining data of a third dimension of the parameter-entering compound after the first analysis is carried out;
sequentially comparing the data of the third dimension of all the compounds in the database with the data of the third dimension of the reference according to the positive sequence of the ID of the compounds in the database;
Screening out data of a third dimension of a preset number of compounds meeting the comparison requirement in the database;
And acquiring ID and structural formula data of the compound corresponding to the data of the third dimension of all the screened compounds.
Optionally, in the chemical industry-based search method, the manner of judging that the data of the third dimension of all the compounds in the database meets the comparison requirement is as follows:
If the first part of attribute values in the data of the third dimension of the compound in the database are larger than or equal to the first part of attribute values in the data of the third dimension of the reference, wherein the first part of attributes in the data of the third dimension comprise n _ atoms, n _ bonds, n _ rings and n _ C;
and a second portion of the attribute values in the data of the third dimension of the compound in the database is equal to a second portion of the attribute values in the data of the participating third dimension, wherein the second portion of the attributes in the data of the third dimension includes n _ C2, n _ b1, n _ b2, n _ bar, n _ r6, and n _ rar;
and all attribute values in the data of the third dimension of the compound in the database are greater than 0;
the comparison requirement is met;
if any attribute in the data of the third dimension of the compound in the database does not meet the requirement, the comparison requirement is not met.
Optionally, in the chemical industry-based search method, the search for the substructures includes the following steps:
carrying out second analysis on the normalized structural formula data of the parameter-entering compound, and obtaining parameter-entering data of a second dimension and parameter-entering data of a third dimension after the second analysis is carried out;
combining the fixed fingerprint value in the database with the standardized structural formula data of the reference compound to obtain a new fingerprint value, and carrying out numerical analysis on the new fingerprint value to obtain a characteristic fingerprint value of the structural formula data of the reference compound;
and if the characteristic fingerprint value is judged to be an odd number, searching a single substructure, and if the characteristic fingerprint value is judged to be an even number, searching a complex substructure.
Optionally, in the chemical industry-based search method, the single substructure search includes the following steps:
sequentially comparing the data of the second dimension of all the compounds in the database with the data of the second dimension of the parameter according to the positive sequence of the compound IDs in the database;
Screening out a preset number of second dimension data of the compounds meeting the comparison requirement in the database in a screening mode that if the value of dfp01 in the second dimension data of the compounds in the database is equal to the value obtained by subtracting 1 from the characteristic fingerprint value, the comparison requirement is met, otherwise, the comparison requirement is not met;
And acquiring ID and structural formula data of the compound corresponding to the screened data of all the second dimensions.
Optionally, in the chemical industry-based search method, the complex substructure search includes the following steps:
setting a first screening condition, sequentially comparing data of a third dimension of all compounds in the database with data of a third dimension of the reference according to the positive sequence of the ID of the compounds in the database, and screening out a preset number of data of the third dimension of the compounds meeting the comparison requirement in the database, wherein the screening mode is that if all attribute values in the data of the third dimension of the compounds in the database are more than or equal to all attribute values in the data of the third dimension of the reference, and all attribute values in the data of the third dimension of the compounds in the database are more than 0, the data of the third dimension of the compounds in the database meeting the first screening condition, otherwise, the data of the third dimension of the compounds in the database does not meet the first screening condition;
setting a second screening condition, enabling the data of the second dimension of each compound in the database and the data of the second dimension of the input parameter to participate in operation in sequence according to the positive sequence of the ID of the compound in the database, screening the data of the second dimension of the preset number of compounds meeting the operation requirement in the database, wherein the screening mode is that all attribute values in the data of the second dimension of each compound in the database correspond to all attribute values in the data of the second dimension of the input parameter one by one, and performing logic and operation, if the logic and operation results of all the attribute values are 1, the data of the second dimension of the compound with the operation results of 1 in the database meet the second screening condition, otherwise, the data of the second dimension of the compound does not meet the second screening condition;
and if the compound in the database simultaneously meets the first screening condition and the second screening condition, screening the ID and structural formula data of the compound, otherwise, not screening the ID and structural formula data of the compound.
optionally, in the chemical industry-based search method, the accurate search and the substructure search further include the following steps:
filtering the ID and structural formula data of all the compounds obtained by screening, so that the ID and structural formula data of the target compound obtained by filtering better meet the requirement of search description;
and sorting the target compounds, and feeding back an ID list of the sorted target compounds.
optionally, in the chemical industry-based search method, the target compounds are sorted according to the following rules:
the sequencing is carried out according to the ascending order of the number of structural rings of the target compound, and the sequencing is carried out according to the ascending order of mol _ weight, wherein the number of the structural rings of the target compound is equal.
optionally, in the chemical industry-based search method, the similarity search includes the following steps:
carrying out third analysis on the normalized structural formula data of the parameter-entering compound, and obtaining parameter-entering data of a second dimension, parameter-entering data of a third dimension and parameter-entering data of a fourth dimension after the third analysis is carried out;
Performing first screening, enabling the data of the third dimension of each compound in the database and the data of the third dimension to participate in operation in sequence according to the positive sequence of the ID of the compound in the database, screening out the data of the third dimension of a preset number of compounds meeting the operation requirement in the database, and acquiring the ID and structural formula data of the compound corresponding to the screened data of all the third dimension;
carrying out secondary screening, and filtering the ID and structural formula data of the compound obtained by the primary screening;
sorting the IDs of the target compounds obtained by secondary screening in a descending order of the filtering scores, and sorting the IDs of the target compounds in an ascending order of mol _ weight for the compounds with the same filtering scores;
feeding back the sorted list of IDs of the target compounds.
optionally, in the chemical industry-based search method, the first screening further includes the following steps:
selecting all compounds with attribute values larger than 0 in the data of the third dimension in the database;
calculating the score value of the data of the third dimension of the selected compound according to the following formula: (a1-B1)2+ (a2-B2)2+ … … (An-Bn)2, where a1, a2 … … An are all attributes in the data of the third dimension of each compound selected; b1 and B2 … … Bn are all attributes in the data of the third dimension of the participation;
screening out the data of the third dimension of the preset number of compounds with the lowest score values, and acquiring the IDs and structural formula data of the compounds corresponding to the screened data of all the third dimensions, wherein the IDs and structural formula data of the compounds are sorted in ascending order according to the score values.
Optionally, in the chemical industry-based search method, the ID and structural formula data of the compound obtained by the first screening are filtered, and the filtering includes the following steps:
calculating the filtration score of the compound obtained from the first screening by the formula: (1-fsim) tanimoto _ s + fsim _ tanimoto _ f, where tanimoto represents a filtered score, fsim represents a similarity percentage input by the user, tanimoto _ s represents an intermediate value calculated from data of the second dimension of the screened compound, and tanimoto _ f represents an intermediate value calculated from data of the fourth dimension of the screened compound;
and filtering the data of the tanimoto which the data is less than the fsim to obtain a target compound corresponding to the residual tanimoto and obtain an ID list of the target compound.
In the chemical industry-based search method provided by the invention, the compound to be stored is stored in the database in various forms, so that the various forms of the compound participate in the search, and the search rate is accelerated; by adopting the structural search, the search in three modes of accurate search, substructure search and similar search is realized, so that the search requirement of a user is met, and the search result is more in line with the requirement that the user needs to obtain the accurate search result, the substructure search result or the similar search result by the search in the three modes, so that the search quality is improved; and the format of the search description is judged and adjusted by adopting pre-search, so that the search description can participate in the search, and the search rate can be accelerated under the condition that the pre-search is successful.
drawings
Fig. 1 is a flowchart of a chemical industry-based search method according to an embodiment of the present invention;
FIG. 2 is a flow chart of a chemical industry-based pre-search provided by an embodiment of the present invention;
FIG. 3 is a flow chart of a chemical industry based precision search provided by an embodiment of the present invention;
FIG. 4 is a flow chart of a chemical industry-based substructure search provided by an embodiment of the present invention;
Fig. 5 is a flowchart of similarity search based on the chemical industry according to an embodiment of the present invention.
Detailed Description
The following describes in more detail embodiments of the present invention with reference to the schematic drawings. The advantages and features of the present invention will become more apparent from the following description. It is to be noted that the drawings are in a very simplified form and are not to precise scale, which is merely for the purpose of facilitating and distinctly claiming the embodiments of the present invention.
Hereinafter, if the method described herein comprises a series of steps, the order of such steps presented herein is not necessarily the only order in which such steps may be performed, and some of the described steps may be omitted and/or some other steps not described herein may be added to the method.
at present, in the prior art, there is no introduction on how to search in the chemical industry, and there is no related technology to improve the search rate in the chemical industry. Therefore, it is necessary to provide a chemical industry-based search method, as shown in fig. 1, fig. 1 is a flowchart of a chemical industry-based search method provided in an embodiment of the present invention, where the chemical industry-based search method includes the following steps:
s1: data storage, wherein the compound to be stored is stored in a database in various forms;
s2: receiving an input search description, and analyzing structural formula data of a reference compound corresponding to the search description;
s3: pre-searching is carried out, if structural formula data of the compound matched with the structural formula data of the reference compound exist in a database in the pre-searching step, the ID of the target compound corresponding to the structural formula data of the successfully matched compound is determined, and the ID of the target compound is fed back;
s4: and if the structural formula data of the compound which is successfully matched in the database is not searched in advance, selecting one searching mode in structural formula searching for searching, wherein the searching mode of the structural formula searching comprises accurate searching, substructure searching and similar searching.
the method has the advantages that the compound to be stored is stored in the database in various forms, so that the various forms of the compound participate in the search, and the search rate is increased; by adopting the structural search, the search in three modes of accurate search, substructure search and similar search is realized, so that the search requirement of a user is met, and the search result is more in line with the requirement that the user needs to obtain the accurate search result, the substructure search result or the similar search result by the search in the three modes, so that the search quality is improved; and the format of the search description is judged and adjusted by adopting pre-search, so that the search description can participate in the search, and the search rate can be accelerated under the condition that the pre-search is successful.
Furthermore, the database adopted by the invention can be various databases such as pgsql, mysql and the like, wherein pgsql can reach 1000W-level data volume, so that pgsql is preferably adopted to improve the search rate.
preferably, the data storage in the present invention includes the following steps:
storing structural formula data of the compound to be stored, wherein the structural formula data can be defined as molData of the compound in a database;
encrypting the structural formula data of the compound to generate compound encrypted data, and storing the compound encrypted data, wherein the compound encrypted data can be defined as molEnCode of the compound in a database;
a fixed fingerprint value for a structural formula search is stored, which may be a specific character string.
analyzing structural formula data of the compound to generate tables in a plurality of databases, wherein the tables are data of a first dimension respectively, and the table name can be struc; data of a second dimension, the table name may be cfp; for data in the third dimension, the table name may be stat; the data in the fourth dimension, the table name, may be fgb.
Further, for example, the data tables of the above four dimensions are as follows:
struc table:
id int(11)PK
molData text Raw structural formula data
molEncode text Data after encryption (MD5) of original structural formula data
smiles text
formula text
inchi text
mol_weight int(11) molecular weight of the Compound
num_rings int(11)
cfp table:
id int(11)PK
dfp01 bigint(20)
hfp01 bigint(20)
hfp02 bigint(20)
hfp03 bigint(20)
hfp04 bigint(20)
hfp05 bigint(20)
hfp06 bigint(20)
hfp07 bigint(20)
hfp08 bigint(20)
hfp09 bigint(20)
hfp10 bigint(20)
hfp11 bigint(20)
hfp12 bigint(20)
hfp13 bigint(20)
hfp14 bigint(20)
hfp15 bigint(20)
hfp16 bigint(20)
stat table:
fgb table:
id int(11)PK
fg01 bigint(20)
fg02 bigint(20)
fg03 bigint(20)
fg04 bigint(20)
fg05 bigint(20)
fg06 bigint(20)
fg07 bigint(20)
fg08 bigint(20)
Furthermore, the structural formula data of the compound is analyzed to generate a plurality of tables in the database, and the analysis method can be as follows:
molData of compounds in the database is subjected to an execution command: checMol-aXbHs, generating 3 groups of data0, data1 and data 2; the character string formed by combining molData of the compound in the database and the fixed fingerprint value is subjected to an execution command matchMol to generate data 3; stat is data 0; fgb ═ data 1; cfp is data2+ data 3; the molData of the compound in the database generates struc data by executing the command obprop.
In one embodiment, molData of compounds in the database is parsed by checkMol into array data like the following:
stat structure: n _ atoms: 9; n _ bones: 9; n _ rings: 1; n _ C2: 6; n _ C: 9; n _ b 1: 3; n _ b 2: 1; n _ bar: 6; n _ r 6: 1; n _ rar: 1;
fgb Structure: 0, 0, 0, 0, 0, 0, 256, 0;
In the process of fgb parsing, the value of data a _ sum _ fg that will participate in the search subsequently may also be parsed.
the character string formed by combining molData and a fixed fingerprint value of the compound in the database is subjected to an execution command matchMol to generate data3, cfp is data2+ data3, and the structure of the generated cfp can be array data as follows:
dfp01:8192;hfp01~hfp16:0,0,64,8388608,0,2048,0,256,0,64,8,4194304,4096,0,2151677952,0。
The database, after storing the compound to be stored in the database in various forms, may further perform the following steps: calculating and storing the substructure search data and the similarity search data for each compound in a data structure for each compound, the data structure for a compound may include: ID. In the substructure search and the similar search, if the compound stored in the database is searched, the substructure search data or the similar search data can be directly called, so that the search time of the substructure search or the similar search is saved, and the search speed is improved.
specifically, as shown in fig. 2, fig. 2 is a flow chart of a chemical industry-based pre-search provided in an embodiment of the present invention, where the pre-search of the chemical industry-based search method includes the following steps:
The method comprises the following steps: verifying the structural formula data (which can be defined as molData of the reference compound) of the reference compound;
step two: standardizing the structural formula data of the reference compound which accords with the rules after the verification;
Step three: encrypting the normalized structural formula data of the reference compound to generate reference secret data (which can be defined as molEnCode of the reference compound);
step four: according to the requirements of users, accurate search, substructure search and similar search can be carried out in the pre-search, and compound encrypted data matched with the input encrypted data in a database is searched;
the user selects accurate search, if the database has successfully matched compound encrypted data, the target compound corresponding to the successfully matched compound encrypted data is searched, the ID of the target compound is determined, and the ID of the target compound is fed back;
a user selects a substructure to search, if the database has successfully matched compound encrypted data, the ID in the substruct of the compound data structure corresponding to the successfully matched compound encrypted data is searched, and all the IDs in the substruct are fed back;
the user selects similar search, if the database has successfully matched compound encrypted data, the ID in the similar of the compound data structure corresponding to the successfully matched compound encrypted data is searched, and all the IDs in the similar are fed back;
and if the database does not contain the successfully matched compound encrypted data, carrying out structural formula search according to the selection of the user on the searching mode of the structural formula search, wherein the searching mode of the structural formula search comprises accurate search, substructure search and similar search.
preferably, as shown in fig. 3, fig. 3 is a flowchart of an accurate search based on the chemical industry provided by the embodiment of the present invention, where the accurate search includes the following steps:
the method comprises the following steps: carrying out first analysis (checkmol-axs) on the normalized structural formula data of the ginseng-entering compound, and obtaining data of a third dimension of the ginseng-entering compound after the first analysis is carried out;
step two: sequentially comparing the data of the third dimension of all the compounds in the database with the data of the third dimension of the reference according to the positive sequence of the ID of the compounds in the database;
Step three: screening out data of a third dimensionality of a preset number of compounds meeting the comparison requirement in the database, wherein the preset number can be 500 data, namely screening out data of the third dimensionality of the first 500 compounds meeting the comparison requirement in the database according to the positive sequence arrangement sequence of the compound IDs in the database;
the way to judge the data of the third dimension of all compounds in the database to meet the alignment requirement is as follows:
if the first part of attribute values in the data of the third dimension of the compound in the database are larger than or equal to the first part of attribute values in the data of the third dimension of the reference, wherein the first part of attributes in the data of the third dimension comprise n _ atoms, n _ bonds, n _ rings and n _ C;
and a second portion of the attribute values in the data of the third dimension of the compound in the database is equal to a second portion of the attribute values in the data of the participating third dimension, wherein the second portion of the attributes in the data of the third dimension includes n _ C2, n _ b1, n _ b2, n _ bar, n _ r6, and n _ rar;
and all attribute values in the data of the third dimension of the compound in the database are greater than 0;
the comparison requirement is met;
if any attribute in the data of the third dimension of the compound in the database does not meet the requirement, the comparison requirement is not met.
step four: acquiring IDs and structural formula data of the compounds corresponding to the data of the third dimension of all the screened compounds;
Step five: filtering the ID and structural formula data of all the compounds obtained by screening, so that the ID and structural formula data of the target compound obtained by filtering better meet the requirement of search description;
step six: ordering the target compounds according to the following rules: firstly, sequencing according to the ascending order of the number of structural rings of a target compound, and sequencing according to the ascending order of mol _ weight when the number of the structural rings of the target compound is equal; and finally feeding back the sorted ID list of the target compound.
Preferably, as shown in fig. 4, fig. 4 is a flowchart of a substructure search based on the chemical industry provided in the embodiment of the present invention, where the substructure search includes the following steps:
The method comprises the following steps: carrying out second analysis (checkMol-axHs) on the normalized structural formula data of the parameter-entering compound, and obtaining data of a second dimension of parameter entering and data of a third dimension of parameter entering after the second analysis is carried out;
step two: combining the fixed fingerprint value in the database with the standardized structural formula data of the reference compound to obtain a new fingerprint value, and carrying out numerical analysis on the new fingerprint value to obtain a characteristic fingerprint value of the structural formula data of the reference compound;
step three: and if the characteristic fingerprint value is judged to be an odd number, searching a single substructure, and if the characteristic fingerprint value is judged to be an even number, searching a complex substructure.
step four:
the single substructure search comprises the steps of:
sequentially comparing the data of the second dimension of all the compounds in the database with the data of the second dimension of the parameter according to the positive sequence of the compound IDs in the database;
Screening out the second-dimension data of the compounds meeting the comparison requirement in a preset number in the database, wherein the preset number can be 500, namely screening out the second-dimension data of the first 500 compounds meeting the comparison requirement in the database according to the positive sequence arrangement sequence of the compound IDs in the database; the screening mode is that if the value of dfp01 in the second dimension data of the compound in the database is equal to the value obtained by subtracting 1 from the characteristic fingerprint value, the comparison requirement is met, otherwise, the comparison requirement is not met;
And acquiring ID and structural formula data of the compound corresponding to the screened data of all the second dimensions.
the complex substructure search comprises the steps of:
setting a first screening condition, sequentially comparing data of a third dimension of all compounds in the database with data of a third dimension of the entered reference according to the positive sequence of the ID of the compounds in the database, and screening out a preset number of data of the third dimension of the compounds meeting the comparison requirement in the database, wherein the preset number can be more than or equal to 500 data, and the screening mode is that if all attribute values in the data of the third dimension of the compounds in the database are more than or equal to all attribute values in the data of the third dimension of the entered reference, and all attribute values in the data of the third dimension of the compounds in the database are more than 0, the data of the third dimension of the compounds entered into the database meet the first screening condition, otherwise, the data of the third dimension of the compounds entered into the database do not meet the first screening condition;
setting a second screening condition, enabling the data of the second dimension of each compound in the database and the data of the second dimension of the parameter to participate in operation in sequence according to the positive sequence of the ID of the compound in the database, screening the data of the second dimension of the compound meeting the operation requirement in the database, wherein the preset number can be more than or equal to 500 data, the screening mode is that all attribute values in the data of the second dimension of each compound in the database correspond to all attribute values in the data of the second dimension of the parameter one by one, and performing logic and operation, for example, hfp01 in the second dimension of the compound in the database and hfp01 in the parameter perform AND operation, if the logic and operation results of all the attribute values are all 1, namely the data of the second dimension of the compound in the second dimension of which the logic and operation results are all 1 participate in the operation from dfp 01-hfp 16 and are all 1, the data of the second dimension of the compound in the database which the logic and operation results are all 1 satisfy the second screening condition, otherwise, the second screening condition is not met;
the complex substructure search can screen 500 compounds satisfying the first screening condition and the second screening condition simultaneously according to the positive sequence of the compound IDs in the database, and acquire the IDs and structural formula data of the 500 compounds, and if the first screening condition and the second screening condition cannot be satisfied simultaneously, the IDs and structural formula data of the compounds cannot be screened.
step five: filtering the ID and structural formula data of all the compounds obtained by searching and screening the single substructure and the complex substructure so that the ID and structural formula data of the target compound obtained by filtering better meet the requirement of search description;
Step six: ordering the target compounds according to the following rules: firstly, sequencing according to the ascending order of the number of structural rings of a target compound, and sequencing according to the ascending order of mol _ weight when the number of the structural rings of the target compound is equal; and finally feeding back the sorted ID list of the target compound.
Preferably, as shown in fig. 5, fig. 5 is a flowchart of a similarity search based on the chemical industry provided in the embodiment of the present invention, where the similarity search includes the following steps:
The method comprises the following steps: performing third analysis (checkmol-axxHbs) on the normalized structural formula data of the parameter-entering compound, and obtaining data of a second dimension of the parameter-entering compound, data of a third dimension of the parameter-entering compound, data of a fourth dimension of the parameter-entering compound and a value of fg00 after the third analysis is performed;
Step two: performing first screening, enabling data of a third dimension of each compound in the database and data of a third dimension to participate in operation in sequence according to the positive sequence arrangement sequence of the compound IDs in the database, and screening out data of the third dimension of a preset number of compounds meeting the operation requirement in the database, wherein the preset number can be 500 data, namely screening out data of the third dimension of the first 500 compounds meeting the operation requirement in the database according to the positive sequence arrangement sequence of the compound IDs in the database;
Specifically, selecting all compounds with attribute values larger than 0 in the data of the third dimension in the database; calculating the score value of the data of the third dimension of the selected compound according to the following formula: (a1-B1)2+ (a2-B2)2+ … … (An-Bn)2, where a1, a2 … … An are all attributes in the data of the third dimension of each compound selected; b1 and B2 … … Bn are all attributes in the data of the third dimension of the participation; screening out the data of the third dimension of the first 500 compounds with the lowest score values, and acquiring the IDs and structural formula data of the compounds corresponding to the data of all the screened third dimensions, wherein the IDs and structural formula data of the compounds are sorted in ascending order according to the score values.
step three: obtaining ID and structural formula data of the sequenced compounds;
step four: carrying out secondary screening, and filtering the ID and structural formula data of the compound obtained by the primary screening;
Specifically, the filtering comprises the following steps: calculating the filtration score of the compound obtained from the first screening by the formula: (1-fsim) tanimoto _ s + fsim _ tanimoto _ f, where tanimoto represents a filtered score, fsim represents a similarity percentage input by the user, tanimoto _ s represents an intermediate value calculated from data of the second dimension of the screened compound, and tanimoto _ f represents an intermediate value calculated from data of the fourth dimension of the screened compound; and filtering the data of the tanimoto which the content is less than the fsim to obtain a target compound corresponding to the residual tanimoto and obtain the ID and structural formula data of the target compound.
tanimoto _ s represents an intermediate value derived from data (cfp table) of a second dimension, which can be calculated as follows:
When a _ sum + b _ sum-c _ sum >0,
tanimoto_s=c_sum/(a_sum+b_sum-c_sum);
and when a _ sum + b _ sum-c _ sum is less than or equal to 0, the tanimoto _ s is equal to 0.
Wherein a _ sum is long.bitcount (dfp01_ result) + long.bitcount (hfp01_ result) +. + long.bitcount (hfp16_ result);
b_sum=Long.bitCount(cfp.dfp01)+Long.bitCount(cfp.hfp01)+...+Long.bitCount(cfp.hfp16);
c_sum=Long.bitCount(cfp.dfp01&dfp01_result)+Long.bitCount(cfp.hfp01&hfp01_result)+...+Long.bitCount(cfp.hfp16&hfp16_result)。
furthermore, cfp.dfp01 and cfp.hfp01.. cfp.hfp16 are attribute values in a cfp table of the compound in the database, dfp01_ result is a new fingerprint value obtained by combining a fixed fingerprint value in the database and the normalized structural formula data of the reference compound, and the new fingerprint value is subjected to numerical analysis (matchmol-Fs) to obtain a characteristic fingerprint value of the structural formula data of the reference compound; hfp01_ result.. hfp16_ result is the value of the attribute in the cfp table for the reference compound.
the tanimoto _ f represents an intermediate value obtained from the data of the fourth dimension (fgb table), and can be obtained by adopting the following calculation mode:
When a _ sum _ fg + b _ sum _ fg-c _ sum _ fg >0,
tanimoto_f=c_sum_fg/(a_sum_fg+b_sum_fg-c_sum_fg);
When a _ sum _ fg + b _ sum _ fg-c _ sum _ fg is less than or equal to 0, tanimoto _ f is equal to 0.
wherein a _ sum _ fg is fgb.fg00 obtained by analyzing structural formula data of the parameter compound;
b_sum_fg+=Long.bitCount(fgb.fg01)+...+Long.bitCount(fgb.fg08)
c_sum_fg+=Long.bitCount(fgb.fg01&fg01_result)+...+Long.bitCount(fgb.fg08&fg08_result);
further, fg01.. fg.fg 08 is an attribute value in fgb table of the compound in the database, and fg01_ result.. fg08_ result is an attribute value in fgb table of the reference compound.
Step five: sorting the IDs of the target compounds obtained by secondary screening in a descending order of the filtering scores, and sorting the IDs of the target compounds in an ascending order of mol _ weight for the compounds with the same filtering scores;
Step six: feeding back the sorted list of IDs of the target compounds.
for the ID of the target compound fed back in the pre-search, the ID list of the target compound fed back in the structural formula search, if the user wants to obtain the commodity or other data corresponding to the ID of the target compound, the user searches once again by using the ID of the searched compound, and all the commodities or other data corresponding to each target compound can be obtained.
preferably, in the present invention, when there is no result in the search, the condition of "no result in search" is recorded and analyzed, and if there are many similar "no result in search" and all the similar "no result in search are caused by the same reason, for example, caused by the search logic and/or data, the search logic and/or data are adjusted.
in summary, in the chemical industry-based search method provided by the invention, the compounds to be stored are stored in the database in various forms, so that the various forms of the compounds participate in the search, and the search rate is accelerated; by adopting the structural search, the search in three modes of accurate search, substructure search and similar search is realized, so that the search requirement of a user is met, and the search result is more in line with the requirement that the user needs to obtain the accurate search result, the substructure search result or the similar search result by the search in the three modes, so that the search quality is improved; and the format of the search description is judged and adjusted by adopting pre-search, so that the search description can participate in the search, and the search rate can be accelerated under the condition that the pre-search is successful.
The above description is only a preferred embodiment of the present invention, and does not limit the present invention in any way. It will be understood by those skilled in the art that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (14)

1. a search method based on chemical industry is characterized by comprising the following steps:
data storage, wherein the compound to be stored is stored in a database in various forms;
receiving an input search description, and analyzing structural formula data of a reference compound corresponding to the search description;
pre-searching is carried out, if structural formula data of the compound matched with the structural formula data of the reference compound exist in a database in the pre-searching step, the ID of the target compound corresponding to the structural formula data of the successfully matched compound is determined, and the ID of the target compound is fed back;
and if the structural formula data of the compound which is successfully matched in the database is not searched in advance, selecting one searching mode in structural formula searching for searching, wherein the searching mode of the structural formula searching comprises accurate searching, substructure searching and similar searching.
2. the chemical industry based search method of claim 1, wherein the database used by the chemical industry based search method comprises pgsql, and the data storage comprises the following steps:
Storing structural formula data of the compound to be stored;
Encrypting the structural formula data of the compound to generate compound encrypted data, and storing the compound encrypted data;
Analyzing structural formula data of the compound to generate data of a first dimension, data of a second dimension, data of a third dimension and data of a fourth dimension, and storing the data of the four dimensions in a database;
the fixed fingerprint value for the structural formula search is stored.
3. the chemical industry based search method of claim 2,
the data of the first dimension comprises a mol _ weight attribute, which is the molecular weight of the compound;
The data of the second dimension comprises dfp01 attributes;
the data of the third dimension includes the following attributes: n _ atoms, n _ bonds, n _ rings, n _ C2, n _ C, n _ b1, n _ b2, n _ bar, n _ r6, and n _ rar.
4. The chemical industry based search method of claim 3, wherein the pre-search comprises the steps of:
verifying structural formula data of the reference compound;
standardizing the structural formula data of the reference compound which accords with the rules after the verification;
Encrypting the standardized structural formula data of the parameter-entering compound to generate parameter-entering encryption data;
searching a database for compound encryption data that matches the enrollment encryption data;
if the database has successfully matched compound encrypted data, searching structural formula data of a compound corresponding to the successfully matched compound encrypted data, determining the ID of the target compound, and feeding back the ID of the target compound;
And if the database does not have the successfully matched compound encrypted data, carrying out structural formula search according to the selection of the user on the search mode of the structural formula search.
5. the chemical industry based search method of claim 4, wherein the accurate search comprises the steps of:
carrying out first analysis on the normalized structural formula data of the parameter-entering compound, and obtaining data of a third dimension of the parameter-entering compound after the first analysis is carried out;
sequentially comparing the data of the third dimension of all the compounds in the database with the data of the third dimension of the reference according to the positive sequence of the ID of the compounds in the database;
screening out data of a third dimension of a preset number of compounds meeting the comparison requirement in the database;
And acquiring ID and structural formula data of the compound corresponding to the data of the third dimension of all the screened compounds.
6. the chemical industry-based search method of claim 5, wherein the data of the third dimension of all the compounds in the database is judged to meet the comparison requirement in the following way:
if the first part of attribute values in the data of the third dimension of the compound in the database are larger than or equal to the first part of attribute values in the data of the third dimension of the reference, wherein the first part of attributes in the data of the third dimension comprise n _ atoms, n _ bonds, n _ rings and n _ C;
and a second portion of the attribute values in the data of the third dimension of the compound in the database is equal to a second portion of the attribute values in the data of the participating third dimension, wherein the second portion of the attributes in the data of the third dimension includes n _ C2, n _ b1, n _ b2, n _ bar, n _ r6, and n _ rar;
And all attribute values in the data of the third dimension of the compound in the database are greater than 0;
the comparison requirement is met;
if any attribute in the data of the third dimension of the compound in the database does not meet the requirement, the comparison requirement is not met.
7. the chemical industry based search method of claim 4, wherein the substructure search comprises the steps of:
carrying out second analysis on the normalized structural formula data of the parameter-entering compound, and obtaining parameter-entering data of a second dimension and parameter-entering data of a third dimension after the second analysis is carried out;
combining the fixed fingerprint value in the database with the standardized structural formula data of the reference compound to obtain a new fingerprint value, and carrying out numerical analysis on the new fingerprint value to obtain a characteristic fingerprint value of the structural formula data of the reference compound;
And if the characteristic fingerprint value is judged to be an odd number, searching a single substructure, and if the characteristic fingerprint value is judged to be an even number, searching a complex substructure.
8. the chemical industry based search method of claim 7, wherein the single substructure search comprises the steps of:
sequentially comparing the data of the second dimension of all the compounds in the database with the data of the second dimension of the parameter according to the positive sequence of the compound IDs in the database;
screening out a preset number of second dimension data of the compounds meeting the comparison requirement in the database in a screening mode that if the value of dfp01 in the second dimension data of the compounds in the database is equal to the value obtained by subtracting 1 from the characteristic fingerprint value, the comparison requirement is met, otherwise, the comparison requirement is not met;
and acquiring ID and structural formula data of the compound corresponding to the screened data of all the second dimensions.
9. The chemical industry-based search method of claim 7, wherein the complex substructure search comprises the steps of:
setting a first screening condition, sequentially comparing data of a third dimension of all compounds in the database with data of a third dimension of the reference according to the positive sequence of the ID of the compounds in the database, and screening out a preset number of data of the third dimension of the compounds meeting the comparison requirement in the database, wherein the screening mode is that if all attribute values in the data of the third dimension of the compounds in the database are more than or equal to all attribute values in the data of the third dimension of the reference, and all attribute values in the data of the third dimension of the compounds in the database are more than 0, the data of the third dimension of the compounds in the database meeting the first screening condition, otherwise, the data of the third dimension of the compounds in the database does not meet the first screening condition;
Setting a second screening condition, enabling the data of the second dimension of each compound in the database and the data of the second dimension of the input parameter to participate in operation in sequence according to the positive sequence of the ID of the compound in the database, screening the data of the second dimension of the preset number of compounds meeting the operation requirement in the database, wherein the screening mode is that all attribute values in the data of the second dimension of each compound in the database correspond to all attribute values in the data of the second dimension of the input parameter one by one, and performing logic and operation, if the logic and operation results of all the attribute values are 1, the data of the second dimension of the compound with the operation results of 1 in the database meet the second screening condition, otherwise, the data of the second dimension of the compound does not meet the second screening condition;
And if the compound in the database simultaneously meets the first screening condition and the second screening condition, screening the ID and structural formula data of the compound, otherwise, not screening the ID and structural formula data of the compound.
10. The chemical industry based search method of any one of claims 6, 8 or 9, wherein the employing of the precise search and the sub-structure search further comprises the steps of:
Filtering the ID and structural formula data of all the compounds obtained by screening, so that the ID and structural formula data of the target compound obtained by filtering better meet the requirement of search description;
and sorting the target compounds, and feeding back an ID list of the sorted target compounds.
11. the chemical industry based search method of claim 10, wherein the target compounds are ordered according to the following rules:
the sequencing is carried out according to the ascending order of the number of structural rings of the target compound, and the sequencing is carried out according to the ascending order of mol _ weight, wherein the number of the structural rings of the target compound is equal.
12. the chemical industry based search method of claim 4, wherein the similarity search comprises the steps of:
carrying out third analysis on the normalized structural formula data of the parameter-entering compound, and obtaining parameter-entering data of a second dimension, parameter-entering data of a third dimension and parameter-entering data of a fourth dimension after the third analysis is carried out;
performing first screening, enabling the data of the third dimension of each compound in the database and the data of the third dimension to participate in operation in sequence according to the positive sequence of the ID of the compound in the database, screening out the data of the third dimension of a preset number of compounds meeting the operation requirement in the database, and acquiring the ID and structural formula data of the compound corresponding to the screened data of all the third dimension;
carrying out secondary screening, and filtering the ID and structural formula data of the compound obtained by the primary screening;
sorting the IDs of the target compounds obtained by secondary screening in a descending order of the filtering scores, and sorting the IDs of the target compounds in an ascending order of mol _ weight for the compounds with the same filtering scores;
feeding back the sorted list of IDs of the target compounds.
13. The chemical industry based search method of claim 12, wherein the first filtering further comprises the steps of:
selecting all compounds with attribute values larger than 0 in the data of the third dimension in the database;
calculating the score value of the data of the third dimension of the selected compound according to the following formula: (A)1-B1)2+(A2-B2)2+……(An-Bn)2wherein A is1、A2……Anall attributes in the data for the third dimension for each selected compound; b is1、B2……Bnall attributes in the data of the third dimension that are entries;
screening out the data of the third dimension of the preset number of compounds with the lowest score values, and acquiring the IDs and structural formula data of the compounds corresponding to the screened data of all the third dimensions, wherein the IDs and structural formula data of the compounds are sorted in ascending order according to the score values.
14. the chemical industry based search method of claim 13, wherein the ID and structural formula data of the compound obtained from the first screening is filtered, and the filtering comprises the following steps:
calculating the filtration score of the compound obtained from the first screening by the formula: (1-fsim) tanimoto _ s + fsim _ tanimoto _ f, where tanimoto represents a filtered score, fsim represents a similarity percentage input by the user, tanimoto _ s represents an intermediate value calculated from data of the second dimension of the screened compound, and tanimoto _ f represents an intermediate value calculated from data of the fourth dimension of the screened compound;
And filtering the data of the tanimoto which the data is less than the fsim to obtain a target compound corresponding to the residual tanimoto and obtain an ID list of the target compound.
CN201910779876.3A 2019-08-22 2019-08-22 Search method based on chemical industry Pending CN110569420A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910779876.3A CN110569420A (en) 2019-08-22 2019-08-22 Search method based on chemical industry

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910779876.3A CN110569420A (en) 2019-08-22 2019-08-22 Search method based on chemical industry

Publications (1)

Publication Number Publication Date
CN110569420A true CN110569420A (en) 2019-12-13

Family

ID=68774112

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910779876.3A Pending CN110569420A (en) 2019-08-22 2019-08-22 Search method based on chemical industry

Country Status (1)

Country Link
CN (1) CN110569420A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113903410A (en) * 2021-12-08 2022-01-07 成都健数科技有限公司 Compound retrieval method and system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040006559A1 (en) * 2002-05-29 2004-01-08 Gange David M. System, apparatus, and method for user tunable and selectable searching of a database using a weigthted quantized feature vector
CN102929907A (en) * 2012-08-17 2013-02-13 上海泰坦科技有限公司 Hand-drawn type chemical molecular structural formula searching method
WO2013038698A1 (en) * 2011-09-14 2013-03-21 独立行政法人産業技術総合研究所 Search system, search method, and program
US20140201194A1 (en) * 2013-01-17 2014-07-17 Vidyasagar REDDY Systems and methods for searching data structures of a database
CN104750761A (en) * 2013-12-31 2015-07-01 上海致化化学科技有限公司 Method for creating molecular structure databases and method for searching same
CN106168982A (en) * 2016-08-03 2016-11-30 成都四象联创科技有限公司 Data retrieval method for particular topic

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040006559A1 (en) * 2002-05-29 2004-01-08 Gange David M. System, apparatus, and method for user tunable and selectable searching of a database using a weigthted quantized feature vector
WO2013038698A1 (en) * 2011-09-14 2013-03-21 独立行政法人産業技術総合研究所 Search system, search method, and program
CN102929907A (en) * 2012-08-17 2013-02-13 上海泰坦科技有限公司 Hand-drawn type chemical molecular structural formula searching method
US20140201194A1 (en) * 2013-01-17 2014-07-17 Vidyasagar REDDY Systems and methods for searching data structures of a database
CN104750761A (en) * 2013-12-31 2015-07-01 上海致化化学科技有限公司 Method for creating molecular structure databases and method for searching same
CN106168982A (en) * 2016-08-03 2016-11-30 成都四象联创科技有限公司 Data retrieval method for particular topic

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
侯博议等: "无监督的中文商品属性结构化方法", 《软件学报》 *
尹文科等: "基于Skyline的搜索结果排序方法", 《计算机应用》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113903410A (en) * 2021-12-08 2022-01-07 成都健数科技有限公司 Compound retrieval method and system

Similar Documents

Publication Publication Date Title
Mingers et al. Counting the citations: A comparison of Web of Science and Google Scholar in the field of business and management
Chen et al. A music recommendation system based on music data grouping and user interests
TWI396984B (en) Ranking functions using a biased click distance of a document on a network
JP5575902B2 (en) Information retrieval based on query semantic patterns
US8606796B2 (en) Method and system for creating a data profile engine, tool creation engines and product interfaces for identifying and analyzing files and sections of files
US7870151B2 (en) Fast accurate fuzzy matching
CN1489089A (en) Document search system and question answer system
AU2011239618B2 (en) Ascribing actionable attributes to data that describes a personal identity
US20040249808A1 (en) Query expansion using query logs
US20070282827A1 (en) Data Mastering System
CN107665217A (en) A kind of vocabulary processing method and system for searching service
EP2631815A1 (en) Method and device for ordering search results, method and device for providing information
US20100042610A1 (en) Rank documents based on popularity of key metadata
US20160170993A1 (en) System and method for ranking news feeds
WO2016057000A1 (en) Customs tariff code classification
CN109063171B (en) Resource matching method based on semantics
US8463763B2 (en) Method and tool for searching in several data sources for a selected community of users
CN114911999A (en) Name matching method and device
CN110569420A (en) Search method based on chemical industry
JP6842397B2 (en) Business support system and business support method
CN113157869A (en) Method and system for accurately positioning and retrieving documents
KR100899930B1 (en) System and Method for Generating Relating Data Class
JP5439235B2 (en) Document classification method, document classification device, and program
CN105893397A (en) Video recommendation method and apparatus
Honrado et al. Jobandtalent at recsys challenge 2016

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20191213