CN110619067A - Industry classification-based retrieval method and retrieval device and readable storage medium - Google Patents

Industry classification-based retrieval method and retrieval device and readable storage medium Download PDF

Info

Publication number
CN110619067A
CN110619067A CN201910806758.7A CN201910806758A CN110619067A CN 110619067 A CN110619067 A CN 110619067A CN 201910806758 A CN201910806758 A CN 201910806758A CN 110619067 A CN110619067 A CN 110619067A
Authority
CN
China
Prior art keywords
industry classification
industry
occurrence
preset
keywords
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910806758.7A
Other languages
Chinese (zh)
Inventor
许赵云
许明峰
胡新平
陈明忠
毛瑞彬
赵剑
宋娜
李爱文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHENZHEN STOCK EXCHANGE
Original Assignee
SHENZHEN STOCK EXCHANGE
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHENZHEN STOCK EXCHANGE filed Critical SHENZHEN STOCK EXCHANGE
Priority to CN201910806758.7A priority Critical patent/CN110619067A/en
Publication of CN110619067A publication Critical patent/CN110619067A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9035Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a retrieval method, a retrieval device and a readable storage medium based on industry classification, wherein the retrieval method based on the industry classification comprises the following steps: acquiring an industry classification file related to industry classification; co-occurrence processing is carried out on the contents of the industry classification files to obtain co-occurrence keywords; updating the preset industry classification word according to the co-occurrence key word; and reclassifying the associated company information according to the updated preset industry classification words, wherein after receiving a retrieval request, acquiring the corresponding company information according to the preset industry classification words corresponding to the retrieval keywords. The invention provides a retrieval method, a retrieval device and a readable storage medium based on industry classification, which solve the problem that the retrieval data of emerging industries is incomplete and inaccurate in the prior art.

Description

Industry classification-based retrieval method and retrieval device and readable storage medium
Technical Field
The invention relates to the technical field of information retrieval, in particular to a retrieval method, a retrieval device and a readable storage medium based on industry classification.
Background
The information retrieval is to retrieve information contents related to requirements from an existing information database, and in the retrieval based on industry classification, similar companies or products are retrieved by referring to the industry characteristics of the companies and the characteristics of the operation products, so that the similar companies or the similar products are obtained, and a user can compare and analyze different companies or different products conveniently.
In the prior art, with the emergence of emerging industries and the cross-industry operation of existing companies, when the companies are classified in industries, the emerging industries cannot be accurately determined according to rules of the existing industry classification, so that when contents related to the emerging industries are searched, information related to the emerging industries cannot be accurately acquired, and therefore search data is incomplete and inaccurate.
Disclosure of Invention
The invention provides a retrieval method, a retrieval device and a readable storage medium based on industry classification, and aims to solve the problem that information retrieval data of emerging industries is incomplete and inaccurate in the prior art.
In order to achieve the above object, the present invention provides a search method based on industry classification, which comprises:
acquiring an industry classification file related to industry classification;
co-occurrence processing is carried out on the contents of the industry classification files to obtain co-occurrence keywords;
updating preset industry classification words according to the co-occurrence keywords;
and reclassifying the associated company information according to the updated preset industry classification words, wherein after receiving a retrieval request, acquiring the corresponding company information according to the preset industry classification words corresponding to the retrieval keywords.
Optionally, the step of performing co-occurrence processing on the contents of the industry classification file to obtain co-occurrence keywords includes:
clustering the industry classified files to obtain at least one file group;
and co-occurrence processing is carried out on the contents of the industry classification files in the file group to obtain co-occurrence keywords corresponding to the file group.
Optionally, the step of clustering the industry classification files to obtain at least one file group includes:
acquiring file vectors of the industry classified files, and acquiring distances among the file vectors;
and clustering the industry classification files corresponding to the file vectors with the distance smaller than the preset distance into one file group.
Optionally, the step of performing co-occurrence processing on the contents of the industry classified files in the file group to obtain co-occurrence keywords corresponding to the file group includes:
extracting key words in the content of each industry classification file;
acquiring the repeated occurrence frequency of each keyword;
and taking the keywords with the times larger than the preset times as the co-occurrence keywords.
Optionally, the step of updating the preset industry classification word according to the co-occurrence keyword includes:
acquiring a word vector of the co-occurrence keyword;
and updating the preset industry classification word according to the word vector and a preset word vector.
Optionally, the step of updating the preset industry classification word according to the word vector and a preset word vector includes:
acquiring the similarity between the word vector and the preset word vector;
and when the similarity is greater than or equal to a preset similarity, updating the preset industry classification word by adopting the co-occurrence keywords associated with the word vector.
Optionally, before the step of updating the preset industry classification word according to the co-occurrence keyword, the method further includes:
screening the co-occurrence keywords;
the step of updating the preset industry classification word according to the co-occurrence key word comprises the following steps:
and updating the preset industry classification word by adopting the co-occurrence keywords after screening.
Optionally, the step of screening the co-occurrence keywords includes:
retrieving industry data associated with the co-occurrence keywords;
deleting the co-occurrence keyword when the industry data associated with the co-occurrence keyword is not retrieved.
In order to achieve the above object, the present application provides an industry classification-based search device, which includes a memory, a processor, and an industry classification-based search program stored in the memory and executable on the processor, where the processor implements the method according to any one of the above embodiments when executing the industry classification-based search program.
To achieve the above object, the present application proposes a readable storage medium, on which an industry classification-based retrieval program is stored, which when executed by a processor implements the method according to any one of the above embodiments.
According to the technical scheme, after industry classification files related to industry classification are obtained; co-occurrence processing is carried out on the contents of the industry classification files to obtain co-occurrence keywords, and the preset industry classification words are updated according to the co-occurrence keywords; and reclassifying the associated company information according to the updated preset industry classification words, and adjusting the company industry category, so that when a retrieval request is received, the company information can be obtained according to the retrieval request, the company information is in accordance with the retrieval request, and the industry classification words are supplemented and updated by extracting co-occurrence keywords from the industry files, so that the industry classification words are more accurate, and the problem that the retrieval information of the emerging industry is incomplete and inaccurate in the prior art is solved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.
Fig. 1 is a schematic terminal structure diagram of a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart diagram of a first embodiment of the search method based on industry classification according to the present invention;
FIG. 3 is a flow chart of a second embodiment of the search method based on industry classification according to the present invention;
FIG. 4 is a flow chart of a third embodiment of the search method based on industry classification according to the present invention;
FIG. 5 is a schematic flow chart diagram illustrating a fourth embodiment of the searching method based on industry classification according to the present invention;
FIG. 6 is a schematic flow chart diagram illustrating a fifth embodiment of the industry classification-based search method according to the present invention;
fig. 7 is a flowchart illustrating a sixth embodiment of the searching method based on industry classification according to the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that all the directional indicators (such as up, down, left, right, front, and rear … …) in the embodiment of the present invention are only used to explain the relative position relationship between the components, the movement situation, etc. in a specific posture (as shown in the drawing), and if the specific posture is changed, the directional indicator is changed accordingly.
In addition, the descriptions related to "first", "second", etc. in the present invention are only for descriptive purposes and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
In the present invention, unless otherwise expressly stated or limited, the terms "connected," "secured," and the like are to be construed broadly, and for example, "secured" may be a fixed connection, a removable connection, or an integral part; can be mechanically or electrically connected; they may be directly connected or indirectly connected through intervening media, or they may be connected internally or in any other suitable relationship, unless expressly stated otherwise. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.
In addition, the technical solutions in the embodiments of the present invention may be combined with each other, but it must be based on the realization of those skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination of technical solutions should not be considered to exist, and is not within the protection scope of the present invention.
As shown in fig. 1, fig. 1 is a schematic device structure diagram of a hardware operating environment according to an embodiment of the present invention.
The device of the embodiment of the invention can comprise a control device of a computer and other devices, such as a server, a mobile terminal device, a centralized controller and the like.
As shown in fig. 1, the apparatus may include: a controller 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, and a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the controller 1001 described above.
Those skilled in the art will appreciate that the configuration of the device shown in fig. 1 is not intended to be limiting of the device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
The shooting prompt apparatus may include: a processor 1001, such as a CPU, a memory 1002, a communication bus 1003, and a network interface 1004. The communication bus 1003 is used for implementing connection communication between the components in the device. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1002 may be a high-speed RAM memory or a non-volatile memory (e.g., a disk memory). The memory 1002 may alternatively be a storage device separate from the processor 1001. As shown in fig. 1, the memory 1002, which is a readable storage medium, may include therein an operating system, a network communication module, and an industry classification-based retrieval program.
Those skilled in the art will appreciate that the configuration of the device shown in fig. 1 is not intended to be limiting of the device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
As shown in fig. 1, the memory 1004, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a search program based on industry classification.
In the server shown in fig. 1, the screen 1003 is mainly used to display the contents of the device configuration, and the controller 1001 may be used to call up an application stored in the memory 1004 and perform the following operations:
acquiring an industry classification file related to industry classification;
co-occurrence processing is carried out on the contents of the industry classification files to obtain co-occurrence keywords;
updating preset industry classification words according to the co-occurrence keywords;
and reclassifying the associated company information according to the updated preset industry classification words, wherein after receiving a retrieval request, acquiring the corresponding company information according to the preset industry classification words corresponding to the retrieval keywords.
Further, the controller 1001 may call an application program stored in the memory 1004, and also perform the following operations:
clustering the industry classified files to obtain at least one file group;
and co-occurrence processing is carried out on the contents of the industry classification files in the file group to obtain co-occurrence keywords corresponding to the file group.
Further, the controller 1001 may call an application program stored in the memory 1004, and also perform the following operations:
acquiring file vectors of the industry classified files, and acquiring distances among the file vectors;
and clustering the industry classification files corresponding to the file vectors with the distance smaller than the preset distance into one file group.
Further, the controller 1001 may call an application program stored in the memory 1004, and also perform the following operations:
extracting key words in the content of each industry classification file;
acquiring the repeated occurrence frequency of each keyword;
and taking the keywords with the times larger than the preset times as the co-occurrence keywords.
Further, the controller 1001 may call an application program stored in the memory 1004, and also perform the following operations:
acquiring a word vector of the co-occurrence keyword;
and updating the preset industry classification word according to the word vector and a preset word vector.
Further, the controller 1001 may call an application program stored in the memory 1004, and also perform the following operations:
acquiring the similarity between the word vector and the preset word vector;
and when the similarity is greater than or equal to a preset similarity, updating the preset industry classification word by adopting the co-occurrence keywords associated with the word vector.
Further, the controller 1001 may call an application program stored in the memory 1004, and also perform the following operations:
and screening the co-occurrence keywords.
And updating the preset industry classification word by adopting the co-occurrence keywords after screening.
Further, the controller 1001 may call an application program stored in the memory 1004, and also perform the following operations:
retrieving industry data associated with the co-occurrence keywords;
deleting the co-occurrence keyword when the industry data associated with the co-occurrence keyword is not retrieved.
First embodiment
Referring to fig. 2, the searching method based on industry classification provided in this embodiment includes:
s100, acquiring an industry classification file related to industry classification;
the industry classified file refers to text data related to industry, the content of the industry classified file comprises enterprise bulletins, research reports and related news content, and the content of the industry classified file can also be derived from data on various carriers such as magazines, network articles or newspapers. Specifically, the manner of acquiring the industry classification file may be to acquire content data related to the industry after searching the content on the network through a retrieval device, or may be to acquire the data related to the industry in a manner of being manually added by a user.
S200, co-occurrence processing is carried out on the contents of the industry classification files to obtain co-occurrence keywords;
the co-occurrence keywords refer to words which repeatedly appear in the industry classification files, and particularly, when the industry classification files contain contents related to a certain industry, words related to the industry appear for many times. The co-occurrence processing means that the repeated occurrence frequency of the words in the industry classification files is counted, and co-occurrence keywords are determined according to the repeated occurrence frequency. Specifically, when a certain specific vocabulary in the industry classification file appears for multiple times and the appearance times exceed a preset numerical value, the specific vocabulary is determined as a co-occurrence keyword.
S300, updating preset industry classification words according to the co-occurrence keywords;
the preset industry classification words are words which are published or used in the prior art for industry classification, and a user can classify the existing companies or the existing products according to the preset industry classification words so as to determine the categories and the attributes of the companies and the products. Specifically, after the co-occurrence keywords are obtained, the co-occurrence keywords are added to the preset industry classification words according to the relationship between the co-occurrence keywords and the preset industry classification words, so that industries related to the co-occurrence keywords can be accurately classified through the preset industry classification words.
And S400, reclassifying the associated company information according to the updated preset industry classification words, wherein after receiving a retrieval request, acquiring the corresponding company information according to the preset industry classification words corresponding to the retrieval keywords.
After the preset industry classification words are determined, classifying different companies according to the preset industry classification words, specifically, when a main product of a company is a single product or a main industry of the reference company is a single industry, selecting related industry classifications from the preset industry classification words by the company to serve as industry classification information of the company; when the main business product of the company is multiple or the main business industry of the company is multiple industries, the company selects multiple preset industry classification words associated with the reference company from the preset industry classification words, and determines the weight values of the multiple industries in the reference company according to the specific gravity of different industries in the reference company, so as to determine the industry classification information of the reference company.
After the industry classification information of the reference company is determined, when a retrieval request of a user is detected by a retrieval device, retrieving the company or the operation product of the company according to the retrieval request. Specifically, when the retrieval request is to retrieve a similar company, after determining the company industry of the company to be retrieved, and when retrieving the reference company having the same industry term as the company to be retrieved, it is determined that the reference company is similar to the company to be retrieved; when the retrieval request is to retrieve a product, after the product to be retrieved of the company to be retrieved is determined, the related industry to which the product belongs is determined, whether the product which is the same as or similar to the related industry exists is retrieved in the reference company, and when the product which is the same as or similar to the product to be retrieved is retrieved, the retrieved product is judged to be the same as or similar to the product to be retrieved.
In the technical scheme provided by the embodiment, after industry classification files related to industry classification are obtained; co-occurrence processing is carried out on the contents of the industry classification files to obtain co-occurrence keywords, and the preset industry classification words are updated according to the co-occurrence keywords; the associated company information is reclassified according to the updated preset industry classification words, the company industry classification is adjusted, so that when a retrieval request is received, company information conforming to the retrieval request can be obtained according to the retrieval request, the industry classification words are supplemented and updated by extracting co-occurrence keywords from industry files, the industry classification words are more accurate, and the problem of the prior art that the related company information of emerging industries is subjected to the re-classification is solved
The retrieval information is incomplete and inaccurate.
Second embodiment
Referring to fig. 3, in the first embodiment, the step S200 includes:
s210, clustering the industry classified files to obtain at least one file group;
the clustering processing is to determine a relationship between different industry classification files by calculating similarities between the different industry classification files, specifically, when the contents of one industry classification file and the other industry classification file are both related to a metallurgical industry, the similarities of the two industry classification files are high, the two industry classification files can be classified into the same category, and when one industry classification file is related to the metallurgical industry and the other industry classification file is related to a textile industry, although the two industry classification files are both related to a manufacturing industry, the difference between the two industry classification files is large, and the similarity is low, so that the two industry classification files cannot be classified into the same category.
S220, co-occurrence processing is carried out on the contents of the industry classification files in the file group to obtain co-occurrence keywords corresponding to the file group.
The co-occurrence keywords refer to the industry nouns which appear in the existing data but do not exist in the existing preset industry nouns. The co-occurrence keywords can thus be determined by analyzing the existing material.
Specifically, vocabularies related to industries usually appear in the industry classification files for multiple times, and the co-occurrence relation of the vocabularies in the industry classification files is determined by counting word combinations which repeatedly appear in the industry classification files. Specifically, when a specific industry is introduced in the industry classification file, the industry expression related to the specific industry is often referred to multiple times, and the number of occurrences of the industry expression is determined by counting the industry expression, so as to determine the first new industry noun that may exist in the industry classification file.
In the second embodiment, the step S300 further includes:
and screening the co-occurrence keywords.
With the above embodiment, step S300 includes:
and updating the preset industry classification word by adopting the co-occurrence keywords after screening.
After the co-occurrence keywords are determined, the co-occurrence keywords inevitably comprise redundant words, the redundant words comprise part of emerging industry nouns, but the emerging industry does not have definite major companies or related products, so that other data contents except the industry nouns cannot be searched by the emerging industry nouns. In this case, the co-occurrence keywords need to be subjected to deduplication processing. Specifically, the duplication elimination process may analyze the industry classification file to determine the number of times the co-occurrence keyword appears in the industry classification file, so as to determine the importance degree of the co-occurrence keyword, and in another embodiment, by determining a business company of an industry associated with the co-occurrence keyword, when there is no business company of the industry associated with the co-occurrence keyword, it indicates that the co-occurrence keyword has a low attention degree, and there is no business potential for a while, so as to eliminate the duplication of the co-occurrence keyword.
In the second embodiment, the step of screening the co-occurrence keywords includes:
retrieving industry data associated with the co-occurrence keywords;
deleting the co-occurrence keyword when the industry data associated with the co-occurrence keyword is not retrieved.
The industry data refers to industry information of industries related to the co-occurrence keywords, and the industry data can be information of a main business company or product information.
Specifically, when there is an investment value in an industry, there is a company that uses the industry as a main business, and when there is no industry or a small number of main companies corresponding to the co-occurrence keyword, investment work cannot be performed with the industry as a target, so even if the co-occurrence keyword is related to industry content, the co-occurrence keyword needs to be merged with a similar industry name because the industry scale requirement is temporarily not met.
It is understood that, in another embodiment, when an industry is seriously considered by a relevant practitioner, there may be industry analysis data related to the industry, including but not limited to analysis of the contents of market size, competitive format, development condition, etc. of the industry, and when the industry analysis data related to the industry cannot be queried, it means that the industry cannot exist independently or is not concerned for a while, so that the co-occurrence keyword needs to be merged with the similar industry name.
Third embodiment
Referring to fig. 4, in the second embodiment, the step S210 includes:
s211, obtaining file vectors of the industry classified files, and obtaining distances among the file vectors;
s212, clustering the industry classified files corresponding to the file vectors with the distance smaller than the preset distance into a file group.
The method comprises the steps of calculating industry classification files according to different content information in the industry classification files through the mathematical model, converting the industry classification files into text vectors, wherein the distance between the text vectors can be Euclidean distance or Manhattan distance or included angle cosine or other quantities capable of being used for evaluating the relation between vectors, when calculating whether different industry classification files belong to the same category, the different industry classification files do not belong to the same category through the distance between the file vectors corresponding to different industry classification files when the distance between the file vectors corresponding to different industry classification files is larger than a preset distance, and when the distance between the file vectors corresponding to different industry classification files is smaller than or equal to the preset distance, the different industry classification files belong to the same category, specifically, the text vector converted from one industry classification file and the text vector converted from the other industry classification file are converted When the distance of the vector is close, the industry content of the industry classification file is related to the industry content of the other industry classification file volume.
Fourth embodiment
Referring to fig. 5, in the second embodiment, the step S220 includes:
s221, extracting key words in the content of each industry classification file;
specifically, in the communication industry, the keywords include but are not limited to links, reference signals, and wireless networks, and through the keywords, a user or the retrieval device can determine industries related to the industry classification files.
S222, obtaining the repeated occurrence frequency of each keyword;
and S223, taking the keywords with the times larger than the preset times as the co-occurrence keywords.
And when a plurality of related industries are involved in the industry classification file, judging main related industries in the industry classification file according to the keywords of the different industries. In a specific implementation manner, when the industry classification file relates to both the computer industry and the communication industry, the industry classification file may include related vocabularies such as routing, connection, hot spots and the like, and after different industry keywords in the industry classification file are counted, it is determined that the computer industry vocabularies in the industry classification file have more repeated occurrences than the keywords in the communication industry, and then it is determined that the industry classification file is mainly the industry classification file in the computer industry.
And when the occurrence frequency of the keywords in the industry classification is greater than a preset frequency, judging that the keywords are related to the main content of the industry classification file, and taking the keywords as co-occurrence keywords of the industry classification file.
Fifth embodiment
Referring to fig. 6, in the first to fourth embodiments, step S300 includes:
s310, obtaining word vectors of the co-occurrence keywords;
s320, updating the preset industry classification word according to the word vector and a preset word vector;
in order to obtain the preset word vector, adding the preset industry classification word into a mathematical model, training the mathematical model, obtaining an industry feature word through the industry classification file and the mathematical model, and adding the industry feature word into the mathematical model to obtain the preset word vector; and then updating the preset industry classification word according to the word vector and the preset word vector.
Specifically, the mathematical model is trained to mainly adjust parameters in the mathematical model, so that the screening accuracy of the mathematical model on the industry classification words is improved, specifically, the preset industry classification words refer to determined vocabularies for industry classification, and the parameters of the mathematical model can be adjusted after the preset industry classification words are brought into the mathematical model because the preset industry classification words are accurate industry classification words, so that the training process of the mathematical model is completed.
And after the mathematical model is trained, substituting the industry classification files into the mathematical model, and analyzing the industry classification files according to the trained mathematical model so as to determine the industry classification words screened in the industry classification files by the mathematical model.
The word vector is used for representing a vector of the co-occurrence keyword evaluated through the mathematical model, the preset word vector is used for representing a vector of the industry feature word evaluated through the mathematical model, and the co-occurrence keyword is a word obtained after co-occurrence processing is carried out on the industry classification file, so that the co-occurrence keyword can have words irrelevant to industry classification, and the industry feature word is a word obtained after the industry classification file is substituted into the mathematical model according to the trained mathematical model, so that the industry feature words are all industry classification words, but the limitation of the mathematical model can be received, and the incompleteness of the industry feature word can be caused. And determining the similarity degree of the word vector and the preset word vector by comparing the word vector with the preset word vector, thereby judging the validity of the word vector.
Sixth embodiment
Referring to fig. 7, in the above embodiment, step S320 includes:
s321, obtaining similarity between the word vector and the preset word vector;
and S322, when the similarity is greater than or equal to a preset similarity, adopting the co-occurrence keywords associated with the word vector, and updating the preset industry classification words through the co-occurrence keywords.
Wherein the similarity is used for representing the similarity of the word vector and the industry classification of the preset word vector, specifically, the similarity may be represented by a distance between the word vector and the preset word vector, when the distance between the word vector and the preset word vector is far, the similarity between the word vector and the preset word vector is smaller than a preset similarity, which means that the co-occurrence keywords corresponding to the word vector are greatly different from the trade nouns, so that the co-occurrence keywords corresponding to the word vector are deleted, when the distance between the word vector and the preset word vector is closer and the similarity between the word vector and the preset word vector is greater than or equal to the preset similarity, the contribution keywords corresponding to the word vectors have a greater degree of correlation with the industry names, and therefore co-occurrence keywords corresponding to the word vectors are retained. And after the duplication of the co-occurrence keywords is removed, updating the co-occurrence keywords which are reserved after the duplication removal into the preset industry classification words according to the industry classification which is closest to the preset word vector, so as to finish the updating of the preset industry classification words.
In order to achieve the above object, the present application provides an industry classification-based search device, which includes a memory, a processor, and an industry classification-based search program stored in the memory and executable on the processor, and when the processor executes the industry classification-based search program, the processor implements the method according to any one of the above embodiments.
To achieve the above object, the present application proposes a readable storage medium having an industry classification-based search program stored thereon, wherein the industry classification-based search program is executed by a processor to implement the method according to any one of the above embodiments.
In some alternative embodiments, the processor may be a Central Processing Unit (CPU), other general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, a discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The storage may be an internal storage unit of the device, such as a hard disk or a memory of the device. The memory may also be an external storage device of the device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. provided on the device. Further, the memory may also include both internal and external storage units of the device. The memory is used for storing the computer program and other programs and data required by the device. The memory may also be used to temporarily store data that has been output or is to be output.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention, and all modifications and equivalents of the present invention, which are made by the contents of the present specification and the accompanying drawings, or directly/indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. The industry classification-based retrieval method is characterized by comprising the following steps of:
acquiring an industry classification file related to industry classification;
co-occurrence processing is carried out on the contents of the industry classification files to obtain co-occurrence keywords;
updating preset industry classification words according to the co-occurrence keywords;
and reclassifying the associated company information according to the updated preset industry classification words, wherein after receiving a retrieval request, acquiring the corresponding company information according to the preset industry classification words corresponding to the retrieval keywords.
2. The industry classification-based retrieval method of claim 1, wherein the step of co-occurrence processing the contents of the industry classification file to obtain co-occurrence keywords comprises:
clustering the industry classified files to obtain at least one file group;
and co-occurrence processing is carried out on the contents of the industry classification files in the file group to obtain co-occurrence keywords corresponding to the file group.
3. The industry classification-based retrieval method of claim 2, wherein the step of clustering the industry classification documents to obtain at least one document group comprises:
acquiring file vectors of the industry classified files, and acquiring distances among the file vectors;
and clustering the industry classification files corresponding to the file vectors with the distance smaller than the preset distance into one file group.
4. The industry classification-based retrieval method of claim 2, wherein the step of co-occurrence processing the contents of the industry classification documents in the document group to obtain co-occurrence keywords corresponding to the document group comprises:
extracting key words in the content of each industry classification file;
acquiring the repeated occurrence frequency of each keyword;
and taking the keywords with the times larger than the preset times as the co-occurrence keywords.
5. The industry classification-based retrieval method of claim 1, wherein the step of updating the preset industry classification word according to the co-occurrence keyword comprises:
acquiring a word vector of the co-occurrence keyword;
and updating the preset industry classification word according to the word vector and a preset word vector.
6. The industry classification-based retrieval method of claim 5, wherein the step of updating the preset industry classification word according to the word vector and a preset word vector comprises:
acquiring the similarity between the word vector and the preset word vector;
and when the similarity is greater than or equal to a preset similarity, updating the preset industry classification word by adopting the co-occurrence keywords associated with the word vector.
7. The industry classification-based retrieval method of claim 1, wherein before the step of updating the preset industry classification word according to the co-occurrence keyword, the method further comprises:
screening the co-occurrence keywords;
the step of updating the preset industry classification word according to the co-occurrence key word comprises the following steps:
and updating the preset industry classification word by adopting the co-occurrence keywords after screening.
8. The industry classification-based search method of claim 7, wherein the step of screening the co-occurrence keywords comprises:
retrieving industry data associated with the co-occurrence keywords;
deleting the co-occurrence keyword when the industry data associated with the co-occurrence keyword is not retrieved.
9. An industry classification-based retrieval device, comprising a memory, a processor and an industry classification-based retrieval program stored in the memory and executable on the processor, wherein the industry classification-based retrieval program, when executed by the processor, implements the industry classification-based retrieval method according to any one of claims 1 to 8.
10. A readable storage medium on which an industry classification-based search program is stored, the industry classification-based search program, when executed by a processor, implementing the industry classification-based search method according to any one of claims 1 to 8.
CN201910806758.7A 2019-08-27 2019-08-27 Industry classification-based retrieval method and retrieval device and readable storage medium Pending CN110619067A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910806758.7A CN110619067A (en) 2019-08-27 2019-08-27 Industry classification-based retrieval method and retrieval device and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910806758.7A CN110619067A (en) 2019-08-27 2019-08-27 Industry classification-based retrieval method and retrieval device and readable storage medium

Publications (1)

Publication Number Publication Date
CN110619067A true CN110619067A (en) 2019-12-27

Family

ID=68922611

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910806758.7A Pending CN110619067A (en) 2019-08-27 2019-08-27 Industry classification-based retrieval method and retrieval device and readable storage medium

Country Status (1)

Country Link
CN (1) CN110619067A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112347318A (en) * 2020-10-26 2021-02-09 杭州数智政通科技有限公司 Method, device and medium for dividing industry classes of enterprises
CN113468414A (en) * 2021-06-07 2021-10-01 广州华多网络科技有限公司 Commodity searching method and device, computer equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102169495A (en) * 2011-04-11 2011-08-31 趣拿开曼群岛有限公司 Industry dictionary generating method and device
CN104516903A (en) * 2013-09-29 2015-04-15 北大方正集团有限公司 Keyword extension method and system and classification corpus labeling method and system
US20160103920A1 (en) * 2014-10-10 2016-04-14 Workdigital Limited System for, and method of, searching data records
CN106372117A (en) * 2016-08-23 2017-02-01 电子科技大学 Word co-occurrence-based text classification method and apparatus
CN107193915A (en) * 2017-05-15 2017-09-22 北京因果树网络科技有限公司 A kind of company information sorting technique and device
CN108197098A (en) * 2017-11-22 2018-06-22 阿里巴巴集团控股有限公司 A kind of generation of keyword combined strategy and keyword expansion method, apparatus and equipment
CN110069610A (en) * 2019-03-16 2019-07-30 平安科技(深圳)有限公司 Search method, device, equipment and storage medium based on Solr

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102169495A (en) * 2011-04-11 2011-08-31 趣拿开曼群岛有限公司 Industry dictionary generating method and device
CN104516903A (en) * 2013-09-29 2015-04-15 北大方正集团有限公司 Keyword extension method and system and classification corpus labeling method and system
US20160103920A1 (en) * 2014-10-10 2016-04-14 Workdigital Limited System for, and method of, searching data records
CN106372117A (en) * 2016-08-23 2017-02-01 电子科技大学 Word co-occurrence-based text classification method and apparatus
CN107193915A (en) * 2017-05-15 2017-09-22 北京因果树网络科技有限公司 A kind of company information sorting technique and device
CN108197098A (en) * 2017-11-22 2018-06-22 阿里巴巴集团控股有限公司 A kind of generation of keyword combined strategy and keyword expansion method, apparatus and equipment
CN110069610A (en) * 2019-03-16 2019-07-30 平安科技(深圳)有限公司 Search method, device, equipment and storage medium based on Solr

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
葛诗利等: "《面向大学英语教学的通用计算机作文评分和反馈方法研究》", 30 September 2015, 《上海:上海外语教育出版社》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112347318A (en) * 2020-10-26 2021-02-09 杭州数智政通科技有限公司 Method, device and medium for dividing industry classes of enterprises
CN112347318B (en) * 2020-10-26 2022-08-02 杭州数智政通科技有限公司 Method, device and medium for dividing industry classes of enterprises
CN113468414A (en) * 2021-06-07 2021-10-01 广州华多网络科技有限公司 Commodity searching method and device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109634698B (en) Menu display method and device, computer equipment and storage medium
CN110619067A (en) Industry classification-based retrieval method and retrieval device and readable storage medium
CN115858785A (en) Sensitive data identification method and system based on big data
CN108287850B (en) Text classification model optimization method and device
CN108470065B (en) Method and device for determining abnormal comment text
TWI769665B (en) Target data updating method, electronic equipment and computer readable storage medium
CN114117038A (en) Document classification method, device and system and electronic equipment
CN113761185A (en) Main key extraction method, equipment and storage medium
CN105512270B (en) Method and device for determining related objects
WO2022257455A1 (en) Determination metod and apparatus for similar text, and terminal device and storage medium
CN114021716A (en) Model training method and system and electronic equipment
JP3602084B2 (en) Database management device
CN114022086A (en) Purchasing method, device, equipment and storage medium based on BOM identification
JP4128033B2 (en) Profile data retrieval apparatus and program
CN114003665A (en) Data table field relation identification method and device, electronic equipment and storage medium
CN113656575A (en) Training data generation method and device, electronic equipment and readable medium
CN114154480A (en) Information extraction method, device, equipment and storage medium
CN110941719A (en) Data classification method, test method, device and storage medium
CN113836904B (en) Commodity information verification method
JP4306223B2 (en) Evaluation system for document filtering system
CN114638233A (en) News manuscript first-sending identification method, device and equipment
CN117931997A (en) News event combing method and system
CN114003666A (en) Data table field map generation method and device, electronic equipment and storage medium
CN114328976A (en) Evaluation classification method and device, electronic equipment and storage medium
CN116894073A (en) Sensitive data identification method, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20191227