CN113239257B - Information processing method, information processing device, electronic equipment and storage medium - Google Patents

Information processing method, information processing device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113239257B
CN113239257B CN202110633649.7A CN202110633649A CN113239257B CN 113239257 B CN113239257 B CN 113239257B CN 202110633649 A CN202110633649 A CN 202110633649A CN 113239257 B CN113239257 B CN 113239257B
Authority
CN
China
Prior art keywords
entity information
training
target
processed
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110633649.7A
Other languages
Chinese (zh)
Other versions
CN113239257A (en
Inventor
魏一
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zitiao Network Technology Co Ltd
Original Assignee
Beijing Zitiao Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zitiao Network Technology Co Ltd filed Critical Beijing Zitiao Network Technology Co Ltd
Priority to CN202110633649.7A priority Critical patent/CN113239257B/en
Publication of CN113239257A publication Critical patent/CN113239257A/en
Application granted granted Critical
Publication of CN113239257B publication Critical patent/CN113239257B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the disclosure provides an information processing method, an information processing device, electronic equipment and a storage medium, wherein the method comprises the following steps: determining at least one associated search term associated with the at least one search term to be processed; aiming at each associated search word, processing the current associated search word based on a pre-trained target distributed representation model to obtain a word vector to be associated of the current associated search word; determining entity information to be processed according to at least one feature vector corresponding to the entity information to be matched and the current word vector to be associated, which are stored in a database in advance, aiming at each word vector to be associated; and determining target entity information according to the entity information to be processed. According to the technical scheme provided by the embodiment of the disclosure, related target entity information can be determined based on the to-be-processed search vocabulary, so that the search efficiency and the use experience of a user are improved.

Description

Information processing method, information processing device, electronic equipment and storage medium
Technical Field
The embodiment of the disclosure relates to the technical field of computers, in particular to an information processing method, an information processing device, electronic equipment and a storage medium.
Background
Currently, searching for entity information corresponding to user editing information is often implemented based on a search server. The search server may search for corresponding entity information according to a simple character matching rule.
Because the information edited by the existing user has randomness, variability and irregularity, when the search server searches based on the information edited by the mode, the corresponding entity information can not be searched, and/or the searched entity information is not matched with the entity information required by the user, namely the problems of inconvenient use, lower search efficiency and poor user experience exist in the prior art.
Disclosure of Invention
The embodiment of the disclosure provides an information processing method, an information processing device, electronic equipment and a storage medium, so as to achieve the technical effects of improving the search efficiency of a user and the user experience.
In a first aspect, an embodiment of the present disclosure provides an information processing method, including:
determining at least one associated search term associated with the at least one search term to be processed;
aiming at each associated search word, processing the current associated search word based on a pre-trained target distributed representation model to obtain a word vector to be associated of the current associated search word;
Determining entity information to be processed according to at least one feature vector corresponding to the entity information to be matched and the current word vector to be associated, which are stored in a database in advance, aiming at each word vector to be associated;
And determining target entity information according to the entity information to be processed.
In a second aspect, an embodiment of the present disclosure further provides an information processing apparatus, including:
an associated search vocabulary determination module for determining at least one associated search vocabulary associated with the at least one search vocabulary to be processed;
The to-be-associated word vector determining module is used for processing the current associated search word according to the pre-trained target distributed representation model aiming at each associated search word to obtain the to-be-associated word vector of the current associated search word;
The entity information to be processed determining module is used for determining entity information to be processed according to at least one feature vector corresponding to the entity information to be matched and the current word vector to be associated, which are stored in a database in advance, for each word vector to be associated;
and the target entity information determining module is used for determining target entity information according to the entity information to be processed.
In a third aspect, embodiments of the present disclosure further provide an electronic device, including:
one or more processors;
Storage means for storing one or more programs,
The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the information processing method as described in any of the embodiments of the present disclosure.
In a fourth aspect, the disclosed embodiments also provide a storage medium containing computer-executable instructions for performing the information processing method according to any of the disclosed embodiments when executed by a computer processor.
According to the technical scheme, the search vocabulary to be processed can be processed to obtain the associated search vocabulary associated with the search vocabulary to be processed, and then the current associated vocabulary is processed based on the target distributed representation model trained in advance to obtain the corresponding word vector to be associated, so that the computer can conveniently execute vector matching operation; according to the feature vectors corresponding to the entity information to be matched and the current word vector to be associated which are stored in the database in advance, the entity information to be processed can be determined, and then the target entity information is determined from the entity information to be processed, so that the searching efficiency and the using experience of a user are improved.
Drawings
The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale.
Fig. 1 is a flow chart of an information processing method according to a first embodiment of the disclosure;
Fig. 2 is a flow chart of an information processing method according to a second embodiment of the disclosure;
Fig. 3 is a flow chart of an information processing method according to a third embodiment of the disclosure;
fig. 4 is a flow chart of an information processing method according to a fourth embodiment of the disclosure;
fig. 5 is a block diagram of an information processing apparatus according to a fifth embodiment of the present disclosure;
fig. 6 is a schematic structural diagram of an electronic device according to a sixth embodiment of the disclosure.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.
It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.
The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.
It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.
It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.
Example 1
Fig. 1 is a flowchart of an information processing method according to an embodiment of the present disclosure, where the method may be implemented by an information processing apparatus, and the apparatus may be implemented in software and/or hardware, and the hardware may be an electronic device, such as a mobile terminal, a PC, or a server.
In some scenarios, when a user searches for a relevant university of a city, the "university of a city (of the science class)" may be edited in a search editing frame, and after the editing is completed, the user may not obtain the content wanted when searching only by means of field matching based on the above information due to the more random editing format and editing content. .
The above schematically illustrates a user in searching for college content scenes. In the practical application process, the method can be applied to any scene, and only the entity information corresponding to the scene is required to be stored in the database. It will be appreciated that the above scenario is merely illustrative. Those skilled in the art will appreciate that the present disclosure may be applied to a scenario in which a user searches for multi-category content, such as searching for enterprise information or medical information, without contradiction to the embodiments of the present disclosure.
As shown in fig. 1, the method of the present embodiment includes:
S110, determining at least one associated search vocabulary associated with the at least one search vocabulary to be processed.
It should be noted that, the corresponding application program may be developed based on the technical scheme of the embodiment of the present invention, and the corresponding page may also be integrated based on the technical scheme of the present disclosure. If the page is an integrated page, the page can comprise corresponding search controls and content editing controls. The corresponding search content may be edited in the content editing control, and after the search content editing is completed, the search control may be triggered to retrieve data corresponding to the search content from the database.
The vocabulary to be searched can be the vocabulary edited in the search editing frame, and one or more vocabulary can be used, and the number of the vocabulary to be processed is the same as the number of vocabulary in the search editing frame. For example, if only one vocabulary is included in the edit box, the content to be searched may be one; if the edit box comprises a plurality of vocabularies or special symbols, the vocabularies before and after the special symbols can be used as the vocabularies to be searched at the moment, and the number of the vocabularies to be searched can be a plurality of; or, a vocabulary edited by the user a plurality of times. Of course, when the user edits the vocabulary, the displayed characters can be split according to the stay time of the cursor between every two adjacent characters, so as to obtain a plurality of vocabulary to be searched. For example, when a word "a" is edited in the search editing box, the text information "a" received by the system may be used as a search word to be processed, and the corresponding associated search word may be the search word itself "a" to be processed; when two words "A B" are edited in the search editing box, "A B" may be used as the words to be processed, and the related words may be the two words to be processed "A B" or the words "a" and "B" obtained by disassembling the words to be processed.
And S120, processing the current associated search vocabulary based on a pre-trained target distributed representation model aiming at each associated search vocabulary to obtain a word vector to be associated of the current associated search vocabulary.
The target distributed representation model may be a prestored neural network model, which is used for processing the edited vocabulary to obtain word vectors corresponding to the edited vocabulary. That is, the related search vocabulary can be used as the input of the model, and after the related search vocabulary is processed based on the model, the corresponding word vector to be related can be output, and at this time, the distributed vector corresponding to the related search vocabulary is obtained. The benefit of converting the associated search terms into vectors is: the efficiency of determining cosine similarity is higher than that of character matching, so that a computer can conveniently perform matching operation with higher efficiency. In the practical application process, there are usually a plurality of associated search terms, so that the target distributed representation model needs to repeatedly execute the steps for each associated search term to obtain the word vector to be associated corresponding to each associated search term.
Illustratively, the associated search terms include A1, A2; the corresponding word vector B1 to be associated can be obtained after the associated search word A1 is processed based on the target distributed representation model trained in advance, and the corresponding word vectors B2 and B3 to be associated can be obtained after the associated search word A2 is processed. The number of associated search terms is the same as the number of word vectors to be associated.
S130, determining the entity information to be processed according to at least one feature vector corresponding to the entity information to be matched and the current word vector to be associated, which are stored in a database in advance, aiming at each word vector to be associated.
The database may pre-store all the current entity information and the feature vectors thereof, or pre-store the entity information and the feature vectors thereof corresponding to each scene according to the scene. Optionally, entity information is pre-stored according to scene categories, for example, in a page or an APP used for searching colleges and universities, the entity information of each college is pre-stored in an associated database, and in a page or an APP used for searching enterprises, the entity information of each enterprise is pre-stored in an associated database. After the entity information and the feature vectors thereof are stored in the database in advance, each entity information can be used as the entity information to be matched.
It should be noted that, the entity information stored in the database may be set according to actual requirements, and specific contents thereof are not described herein in detail.
In this embodiment, the entity information is information characterizing one entity object, and may be stored in a database after being associated with at least one feature vector in the form of a mapping table, where the entity information in the database may be regarded as entity information to be matched. For example, the feature vectors corresponding to the entity information "a city science and technology university" may be feature vectors representing "china", "a city", "science and technology class" and "colleges", and the feature vectors corresponding to the entity information "B city university" may be feature vectors representing "china", "B city", "comprehensive class" and "colleges".
In this embodiment, the feature vector with the highest similarity to the current associated word vector is determined based on the cosine similarity algorithm, and the finally determined feature vector may be one or more. Based on the determined feature vector, the entity information to be matched corresponding to the feature vector can be determined in a database in a table look-up mode, and the corresponding entity information to be matched is used as the entity information to be processed.
S140, determining target entity information according to the entity information to be processed.
In this embodiment, all the entity information to be processed may be used as the target entity information, or the entity information to be processed smaller than the preset similarity threshold may be removed, so as to obtain entity information closer to the search vocabulary to be processed. .
For example, for the search vocabulary "the university of science and technology in a city" to be processed, two pieces of entity information to be processed of "the university of electronics and technology in a city" and "the university of electronics and technology in a city" may be determined, and at this time, both pieces of entity information may be used as target entity information, or only "the university of electronics and technology in a city" with the similarity of the corresponding vector being greater than a preset threshold may be used as target entity information. It should be noted that, when the system selects to reject the entity information to be processed smaller than the preset similarity threshold, if the determined entity information to be processed is smaller than the preset similarity threshold, no entity information is displayed.
In practical application, based on the target entity information obtained by the steps, compared with the existing recall mode, the recall rate can be improved by 20% at most under the condition that the processing speed is basically the same.
According to the technical scheme, the search vocabulary to be processed can be processed to obtain the associated search vocabulary associated with the search vocabulary to be processed, and then the current associated vocabulary is processed based on the target distributed representation model trained in advance to obtain the corresponding word vector to be associated, so that the computer can conveniently execute vector matching operation; according to the feature vectors corresponding to the entity information to be matched and the current word vector to be associated which are stored in the database in advance, the entity information to be processed can be determined, and then the target entity information is determined from the entity information to be processed, so that the searching efficiency and the using experience of a user are improved.
Example two
Fig. 2 is a flow chart of an information processing method according to a second embodiment of the present disclosure, where, based on the foregoing embodiment, according to a preset vocabulary extraction rule, a search vocabulary to be used is extracted from search vocabularies to be processed, so that invalid content is removed, and processing efficiency of a computer is improved; according to the search vocabulary to be used, the associated search vocabulary is determined, the capability and the efficiency of subsequent fuzzy matching are further enhanced, the matching range in the vector matching process is enlarged, and the high-efficiency utilization of big data is realized; furthermore, one or more target entity information is displayed in a differentiated display mode, so that the content closest to the to-be-processed search vocabulary can be displayed for the user in an intuitive mode, and more contents related to the to-be-processed search vocabulary can be displayed according to the user requirement. The specific implementation manner can be seen in the technical scheme of the embodiment. Wherein, the technical terms identical to or corresponding to the above embodiments are not repeated herein.
As shown in fig. 2, the method specifically includes the following steps:
s210, extracting at least one search word to be processed according to a preset word extraction rule to obtain at least one search word to be used.
In the scene of searching based on the content in the search editing box, the edited information has higher randomness, for example, non-literal content such as space and punctuation marks may exist in the search vocabulary to be processed, and content which is unfavorable for the system to execute subsequent processing tasks such as 'spoken' may also exist, so vocabulary extraction rules are needed to simplify the search vocabulary to be processed. The extraction rules may be: the method comprises the steps of presetting a character library, wherein the character library comprises space placeholders, various punctuations, spoken words and the like, and when detecting that the content in a search editing frame comprises the content in the preset character library, the system can reject the corresponding content.
For example, when the search vocabulary to be processed is "university of a city (of a science category)", three search vocabularies to be used of "a city", "science" and "university" can be determined according to a preset vocabulary extraction rule, so as to remove space placeholders, brackets, and "contents of the search vocabulary to be processed,
When the content in the search editing frame is detected to be the content in the character library, the search editing frame is emptied, prompt information of 'editing error' is fed back to the front end, and the information processing flow is ended. S220, determining at least one associated search vocabulary according to each search vocabulary to be used.
When a search word to be used is determined, the search word to be used can be used as an associated search word; when the plurality of search words to be used can be determined, the plurality of search words to be used can be used as the associated search words and can be freely combined to obtain more associated search words. It should be noted that, through the vocabulary obtained by free combination, the system can screen the vocabulary by an algorithm based on machine learning to eliminate nonsensical or erroneous combination results.
Continuing with the above example, for the search terms "A city", "science" and "university" to be used, five search terms to be used may be determined for the associated search terms "A city", "science", "university", "A university", and "university of science".
Based on the preprocessing process of the text information, the capability and the efficiency of the subsequent fuzzy matching are further enhanced, the matching range in the vector matching process is enlarged, and the high-efficiency utilization of big data is realized.
S230, processing the current associated search vocabulary based on a pre-trained target distributed representation model aiming at each associated search vocabulary to obtain a word vector to be associated of the current associated search vocabulary.
S240, determining a similarity value between the current word vector to be associated and at least one feature vector corresponding to each entity information to be matched according to each word vector to be associated.
The similarity value is used for representing the similarity degree between the current word vector to be associated and the feature vector corresponding to each entity information to be matched. The higher the similarity value is, the more similar the entity information corresponding to the feature vector is proved to be to the associated search vocabulary. The similarity between the current word vector to be associated and the feature vector can be determined by using a cosine similarity method. The matching calculation formula adopted can be Wherein S 1 is a current word vector to be associated, S 2 is a feature vector corresponding to each entity information to be matched.
In order to clearly describe the process of determining the similarity value, a word vector to be associated and an entity information in the database are described as an example. Assuming that the current word vector to be associated is A1, the feature vectors corresponding to the entity information comprise three feature vectors, namely cosine similarity between A1 and B1, cosine similarity between B2 and cosine similarity between B3. For example, the resulting cosine similarities are C 1、C2、C3, respectively, where C 1>C2>C3. And repeatedly executing the step to obtain the similarity value between the word vector to be associated and the feature vector corresponding to each entity.
S250, determining the entity information to be processed from at least one entity information to be matched according to the similarity value between each feature vector and the preset condition.
The preset conditions may be: setting a similarity threshold, selecting feature vectors with similarity larger than the threshold, and taking entity information to be matched corresponding to the feature vectors as entity information to be processed. The preset conditions may also be: when the determined plurality of feature vectors correspond to one entity information, a mean value or a variance can be obtained based on the similarity of the plurality of feature vectors, and the obtained mean value or variance is larger than a corresponding preset threshold value, the entity information is determined as the entity information to be processed. The preset condition may be a preset number, and the similarity values of the entity information are ordered, and the entity information with a preset number and a high similarity value is used as the entity information to be processed. It should be understood by those skilled in the art that in the actual application process, various determination conditions of the entity information to be processed may be set as required, and the embodiments of the present disclosure are not limited herein specifically.
S260, determining target entity information according to the entity information to be processed.
Optionally, determining the target entity information includes at least one of:
1) Taking the information of each entity to be processed as target entity information; that is, all the entity information to be processed is taken as target entity information.
2) Taking repeated entity information in each piece of entity information to be processed as target entity information;
3) If the entity information to be processed comprises repeated entity information to be processed, reserving one of the repeated entity information to be processed, and taking the finally reserved entity information to be processed as target entity information;
4) Determining target entity information according to the similarity value corresponding to each entity information to be processed and the number of associated search words;
5) And taking the entity information to be processed corresponding to each associated search word and having the highest similarity value as target entity information.
For example, when it is determined that the entity information to be processed is "university of electronics and technology in a city", "university of medical science in a city" and "university in a city", according to the first processing manner, the three information may be used as target entity information; because the number of the associated search words can be multiple, when the 'A electronic technology university' determines two or more times according to different associated search words, according to the second processing mode, the 'A electronic technology university' can be used as target entity information only; on the basis of the second processing mode, if the user wants to provide as many search results as possible, other "A medical university" and "A university" which are determined only once can be used as target entity information together while the "A electronic science and technology university" which is determined for many times is reserved according to the third processing mode; according to a fourth processing mode, when the "university of electronic technology in a city" corresponds to three associated search words and the similarity between the associated search words is greater than a preset threshold, the associated search words can be used as target entity information, that is, the mode takes the similarity of the entity information to be processed and the corresponding number of the associated search words as a judgment basis; on the basis of the fourth processing mode, if the recall result closest to the search vocabulary to be processed is provided for the user, the entity information to be processed with the highest similarity can be directly used as target entity information according to the fifth processing mode, namely 'A market electronic technology university' is used as target entity information.
And S270, displaying the target entity information in a target display area.
The target display area may be an area corresponding to a display control in a page or an APP, or may be an area only used for displaying information, where only one target entity information may be displayed, or a plurality of target entity information may be displayed, so as to display recall results to a user.
Alternatively, when the target entity information is displayed in the area for displaying only information, one or more target entity information may be displayed in not more than a preset number; when the target entity information is displayed on the display control, the target entity information with highest similarity can be displayed on the position corresponding to the target control, so that a user can conveniently determine the target entity information which is most matched with the search vocabulary to be processed, and meanwhile, the rest target entity information is hidden and displayed in a drop-down menu corresponding to the target control, so that when the drop-down menu is detected to be triggered, the rest target entity information is displayed.
Wherein, the button for displaying the drop-down menu based on the trigger instruction is arranged in the target control, which can be understood that when a plurality of target entity information exists, in order to enable the user to see the recall result most relevant to the search vocabulary to be processed at the first time, only the target entity information with the highest similarity, such as "university of adult electronics and science" with the highest similarity to "university of adult (of the science class), may be displayed in the target control. If the user wants to see other recall results, the user can click and display a drop-down menu button through editing equipment such as a mouse, and when the system receives trigger operation, the system can display the rest target entity information such as 'university of Sichuan' and 'university of Chengdu', and in the actual application process, the entity information in the drop-down menu can be used as candidate entity information corresponding to the target entity information with highest similarity.
Optionally, the rest target entity information is hidden and displayed in the menu corresponding to the target control according to the corresponding similarity value sequence, so that when the trigger menu is detected, the target entity information is displayed in the list in sequence.
The ranking may be performed in order from high to low according to the similarity value of the target entity information, and in the practical application process, the ranking may be performed in order from high to low according to the score by scoring the target entity information using a machine learning correlation algorithm. When the instruction triggering the drop-down menu is detected, the information of each target entity is displayed in the menu according to the sequencing result, and the experience of the user in the searching process is improved.
One or more target entity information is displayed in a differentiated display mode, so that the content closest to the to-be-processed search vocabulary can be displayed for the user in an intuitive mode, and more contents related to the to-be-processed search vocabulary can be displayed according to the user requirement.
According to the technical scheme of the embodiment, the search vocabulary to be used is extracted from the search vocabulary to be processed according to the preset vocabulary extraction rule, so that contents which are unfavorable for system processing are removed; according to the search vocabulary to be used, the associated search vocabulary is determined, the capability and the efficiency of subsequent fuzzy matching are further enhanced, the matching range in the vector matching process is enlarged, and the high-efficiency utilization of big data is realized; furthermore, one or more target entity information is displayed in a differentiated display mode, so that the content closest to the to-be-processed search vocabulary can be displayed for the user in an intuitive mode, and more contents related to the to-be-processed search vocabulary can be displayed according to the user requirement.
Example III
Fig. 3 is a flow chart of an information processing method according to a third embodiment of the present disclosure, where, based on the foregoing embodiment, a target distributed representation model may be obtained through pre-training, so that input data may be processed based on the target distributed representation model to obtain corresponding feedback data. The specific implementation manner can be seen in the technical scheme of the embodiment. Wherein, the technical terms identical to or corresponding to the above embodiments are not repeated herein.
As shown in fig. 3, the method specifically includes the following steps:
s310, a first training sample set and a second training sample set are obtained.
The first training sample set is used for pre-training the distributed representation model to be trained. In order to improve the accuracy of the trained distributed representation model, the first training sample set is as much and as rich as possible. At this time, the first training sample set includes a plurality of first positive sample training data and first negative sample training data, and the first positive sample training data includes pre-training entity information and at least one feature corresponding to the pre-training entity information; the first negative-sample training data includes pre-training entity information and at least one feature not associated with the pre-training entity information. It should be noted that, the positive samples in the first training sample set are all referred to as first positive sample training data, and the negative samples are all referred to as first negative sample training data. The second training sample set also comprises a plurality of training sample data for retraining the model obtained by the pre-training to obtain a target distributed representation model capable of carrying out word vector processing on the input vocabulary.
For a clear understanding of the positive and negative samples in the first training sample set in the embodiments of the present disclosure, specific examples may be described as follows: of course, the technical solution of the present disclosure is mainly for recalling corresponding entity information, and thus may be described by specific entity information and corresponding features, for example, entity information is the university of idiom and if the feature is a country, the feature associated with the entity information is china, and the feature not associated with entity information "the university of idiom and electronics may be" united states ". Based on the above information, "Chengdu electronics science and technology university-China" may be used as the first positive sample training data, and "Chengdu electronics science and technology university-United states" may be used as the first negative sample training data.
S320, training the to-be-trained distributed representation model based on the first training sample set to obtain a pre-trained distributed representation model.
The distributed representation model to be trained is a model with default model parameters, namely, the model parameters in the model are initial values. The pre-trained distributed representation model is trained based on first positive sample training data and first negative sample training data in a first training sample set. Model parameters in the pre-trained distributed representation model are adjusted parameters, different from the initial values.
It can be understood that, since the content displayed in the content editing frame has many random and polygonal shapes, in order to improve the accuracy of determining the target entity information according to the searched content, a great amount of data in the entity database and the corresponding entity features thereof can be fully utilized for pre-training. Alternatively, for each entity having a characteristic field, e.g., the field may be the entity's location and industry, a pair of training data is generated based on the entity name and the characteristic field. The entity and a corresponding characteristic field are taken as a positive sample, and the entity and a characteristic field obtained by random negative sampling, namely, a non-matched characteristic field, are taken as a negative sample. And setting the set output value corresponding to the positive sample as a first preset value, and setting the set output value corresponding to the negative sample as a second preset value, for example, outputting 1 by the first positive sample training data and outputting 0 by the first negative sample training data. And taking the samples as input and output of the distributed representation model to be trained, and adjusting model parameters in the model to be trained until training sample data in the first training sample set participate in training. For example, [ SEP ] as entity name and feature as separator, this can be taken as input [ CLS ] to the university of capital electronics [ SEP ] chinese [ SEP ], this is output [ CLS ] =1; [ CLS ] Chengdu university of electronic technology [ SEP ] U.S. SEP ]; cls=0.
To determine whether the pre-trained distributed representation model is available, it may be determined by determining the accuracy of the pre-trained distributed representation model. Optionally, verifying the pre-trained distributed representation model based on a test sample set corresponding to the first training sample data set; when the accuracy of the pre-trained distributed representation model is detected to be within the preset range, the description can train to obtain the target distributed representation model based on the pre-trained distributed representation model.
The test sample set also comprises a plurality of test sample data, the content of each test sample data is substantially the same as the content in the first training sample data set, and certain differences exist in specific content. The data in the test sample set is also entity-associated feature-expected output result, entity-unassociated feature-expected output result. The test sample data can be input into the pre-training distributed representation model, and corresponding actual output results can be obtained. This step is repeatedly performed, and a plurality of actual output results can be obtained. According to the actual output result and the expected output result of each test sample data, the accuracy of the pre-training distributed model can be obtained. For example, if the number of root check sample data is 1000, and 800 actual output results of the model are matched with the set output results in the check sample data, the accuracy of the pre-training distributed representation model is determined to be 0.8.
If the preset accuracy threshold is above 0.95, the accuracy of the obtained pre-training distributed representation model is 0.8 and does not meet the preset accuracy threshold, the first training sample data set can be re-acquired again, the model is retrained, and the model obtained through training is checked to obtain the usable pre-training pencil test representation model. Of course, if the preset accuracy threshold is 0.75, indicating that the accuracy of the pre-trained distributed representation model is determined to exceed the preset accuracy threshold, then the distributed representation model may be pre-trained at this time based on the second training sample data set to obtain the target distributed representation model.
S330, inputting the data in the second training sample set into the pre-training distributed representation model to obtain a second training vector corresponding to the data in the second training sample set.
In general, the training data in the second training sample data set is also potentially rich and rich in order to improve the accuracy of the model. The second training sample data set is used for training to obtain a target distributed representation model. It should be noted that, in order to enable the model including the voice information to be used to determine the similarity between the input and the entity name, further, after the text is input, the similar word vector between the entities associated with the text may be obtained. The training sample data may be based on the actual input of the user, the actual entity information corresponding to the actual input, and the negative sampling input corresponding to the actual entity. Wherein, each second training sample data in the second training sample set comprises an actual input, actual entity information corresponding to the actual input and a negative sampling input corresponding to the actual entity. Therefore, three texts in the training sample data can be respectively input into the model of the pre-training distributed representation to obtain three word vectors. Correspondingly, the obtained three word vectors can be used as a second training vector, namely the second training vector comprises 3 word vectors, and each word vector corresponds to information in training sample data.
In order to enable the obtained model to convert the input vocabulary into the vectors which are most matched with the input vocabulary, the obtained three word vectors can be processed again, and the accuracy and universality of the pre-trained distributed representation model can be adjusted based on the processing result.
For the sake of clarity of describing the technical solution of this embodiment, it may be described by taking processing a second training sample data as an example, where the second training sample data 1 in the second training sample set may be: the actual input vocabulary A, the actual entity information B corresponding to the actual input vocabulary A, and the actual entity information B are subjected to negative sampling (irrelevant) input G, the actual input vocabulary A is input into the pre-training distributed representation model, an output vector A ' can be obtained, the actual entity information B is input into the pre-training distributed representation model, an output vector B ' can be obtained, and the actual entity information B is subjected to negative sampling (irrelevant) input G into the pre-training distributed representation model, and an output vector G ' can be obtained. A ', B ', and G ' may be taken as vectors in the second training vector.
And S340, correcting the preset loss function in the pre-training distributed representation model based on the back propagation algorithm and the second training vector.
Specifically, the loss function is preset, the model parameters in the model can be corrected based on the loss function, and the expression form of the loss function can be
Loss=max(margin+Distance(query,matched_entity)-Distance(query,negitive_en tity),0)
The Loss is a model Loss, margin is a preset interval, distance (query, negitive _entity) is a Distance between an actually input word and actual entity information, and Distance (query, negitive _entity) is a Distance between a negatively sampled input word and actual entity information; distance (s 1, s 2) is the Distance between the two vectors; =1-Cosine (s 1, s 2) (Cosine similarity between two vectors).
On the basis of the above exemplary embodiments, the Distance between a 'and B' and the Distance between B 'and G' may be calculated, respectively, and the specific Distance calculation formula is Distance (s 1, s 2) =1-Cosine (s 1, s 2). For example, the distance between a 'and B' may be calculated by calculating the cosine similarity between a 'and B', subtracting the cosine similarity from 1 to obtain a distance l 1, calculating the distance l 2 between B 'and G' in the same manner, and taking the calculated distance into a loss function to obtain: loss=max (margin+distance (a 'and B') -Distance (B 'and G'), 0) =max (preset interval+l 1-l2, 0). If the preset interval +l 1-l2 is greater than 0, the Loss function value loss=preset interval +l 1-l2, and vice versa is 0.
The training may be performed in the above manner for each second training sample data.
In this embodiment, the target distributed representation model trained in the above manner can accurately process the input text, thereby obtaining the word vector associated with the entity of the input text.
S350, training the pre-training distributed representation model by taking convergence of a preset loss function as a training target so as to obtain the target distributed representation model.
Specifically, the training error of the loss function, that is, the loss parameter may be used as a condition for detecting whether the loss function currently reaches convergence, for example, whether the training error is smaller than a preset error or whether the error variation trend tends to be stable, or whether the current iteration number is equal to the preset number. If the detection reaches the convergence condition, for example, the training error of the loss function reaches less than the preset error or the error change tends to be stable, which indicates that the training of the pre-training distributed representation model is completed, and at the moment, the iterative training can be stopped. If the current condition of convergence is not detected, the data in the second training sample set can be further acquired to train the pre-training distributed representation model until the training error of the loss function is within a preset range. When the training error of the loss function reaches convergence, the pre-training distributed expression model can be used as a target distributed expression model, namely, each associated search word can be processed based on the model at the moment so as to obtain a corresponding word vector to be associated.
S360, determining at least one associated search vocabulary associated with the at least one search vocabulary to be processed.
And S370, processing the current associated search vocabulary based on a pre-trained target distributed representation model aiming at each associated search vocabulary to obtain a word vector to be associated of the current associated search vocabulary.
S380, determining the entity information to be processed according to at least one feature vector corresponding to the entity information to be matched and the current word vector to be associated, which are stored in the database in advance, aiming at each word vector to be associated.
S390, determining target entity information according to the entity information to be processed.
According to the technical scheme, the pre-training distributed representation model is obtained based on the first training sample set, so that characteristic information corresponding to each entity is fused into the model, then the pre-training distributed representation model is trained based on the second training sample set to obtain the target distributed representation model, input information can be processed based on the model to obtain the association vector which is most matched with corresponding entity information, and further the target entity information which is most matched with the association vector is searched from the database based on the association vector, so that the technical effects of convenience and accuracy in determining the target entity information are improved.
Example IV
Fig. 4 is a flow chart of an information processing method according to a fourth embodiment of the present disclosure, and based on the foregoing embodiment, in order to quickly retrieve target entity information corresponding to a vocabulary to be searched from a database, different types of entity information may be differentially stored. Meanwhile, in order to realize that corresponding resources can still be used in the process of data updating, the resources can be updated based on the mode disclosed by the embodiment. The specific implementation manner can be seen in the technical scheme of the embodiment. Wherein, the technical terms identical to or corresponding to the above embodiments are not repeated herein.
As shown in fig. 4, the method specifically includes the following steps:
S410, determining entity types of the entity information to be matched, and storing the corresponding entity information to be matched in a partition mode according to the data quantity corresponding to each entity type so as to determine the entity information to be processed corresponding to the associated search vocabulary from each partition.
In order to quickly match the corresponding entity in the database, the entity information to be matched may be stored in a partitioned manner according to the entity type of the entity information, where the entity type may correspond to a search scenario of the user, for example, the database may store entity information corresponding to a hospital, a college, and a corporation, and the entity information may be a hospital name, a college name, and a corporation name. At this time, hospitals, universities, and businesses may be entity categories. In order to realize ordered storage, data information of different entity categories can be stored in a classified mode. For example, there is a storage space for the entity class for a college, a storage space for the entity class for an enterprise, etc. For example, the number of entity information corresponding to the "college class" is 1000, and the number of entity information corresponding to the "company class" is 5000, so that the "college class" entity information may be stored in the table a of the database, and the "company class" entity information may be stored in the table B of the database. Of course, in order to precisely store, a corresponding storage mode may be determined according to the data amount corresponding to the entity class. The data amount may be understood as the data amount, which may be how many data consists, and the data size, which may be understood as the occupied content.
Optionally, determining an actual data volume corresponding to each entity class; storing entity information to be matched corresponding to a target entity class with the actual data quantity smaller than a first preset data quantity threshold value into a first partition; storing entity information to be matched corresponding to a target entity class with the actual data volume larger than the first preset data volume threshold and smaller than the second preset data volume threshold into a second partition; and storing entity information to be matched corresponding to the target entity class with the actual data quantity larger than the second preset data quantity threshold value into a third partition, and storing the entity information in the third partition in a layering manner.
In this embodiment, different types of entity information may be stored differently after the data amount threshold is preset. Specifically, according to the first preset data amount threshold, the data smaller than the threshold is accurately stored in the first partition of the database, which can be understood that the feature vector corresponding to the entity information is stored in a complete form, and by setting the first preset data amount threshold to 2000, when the data amount of the entity information corresponding to the "college class" is 1000, the feature vector corresponding to each college can be completely stored in the first partition. And for entity information which is larger than the first preset data quantity threshold and smaller than the second preset data quantity threshold, selecting a normal mode in a second partition of the database for storage. According to a second preset data quantity threshold, data larger than the threshold are stored in a database in a layered mode, and it can be understood that feature vectors corresponding to part of entity information are stored accurately, feature vectors corresponding to other entity information are stored in a lossy mode, and for a lossy storage mode, the vectors need to be compressed based on a product quantization method. For example, when the second preset data amount threshold is set to 4000 and the entity information data amount corresponding to the "company class" is 5000, feature vectors corresponding to the respective companies may be stored in the third partition in the form of hierarchical storage.
It should be noted that, in the third partition, the entity information to be matched may be stored in the third partition in a layered manner according to the calling frequency of the entity information to be matched. It can be understood that the system may continuously record the number of times that the entity information to be matched is determined as the target entity information, and when the system determines that part of the entity information in the third partition is determined as the target entity information and called for multiple times, it may determine that the entity information is information with higher search heat, and accurately store the feature vectors corresponding to the entity information.
By differentially storing different types of entity information, reasonable utilization of limited storage resources is realized.
S420, when at least one partition is detected to have data update, adding an entity information identifier and a partition identifier corresponding to the updated data into a target update list; and updating the corresponding entity information in the database according to the update type in the target update list.
And storing the added and deleted entity information identifiers and partition identifiers in the target update list. The benefit of adding to the target update list is that it is possible to quickly determine which entity information and distribution to update to improve update and storage efficiency.
In the actual application process, the system can adopt a double-update partition strategy based on the main update partition and the backup update partition, and it can be understood that after the system receives the updated data, the system can determine the corresponding entity information identifier and partition identifier through the target update list, and update the entity information in the backup update partition, for example, the cloud database is used as the backup update partition. The system may normally use the primary update partition during the update, e.g., with the local database as the primary update partition, so the process of data update does not affect the normal operation of the system. After the data updating is completed, the identification of the main and standby updating partitions can be exchanged, so that the standby updating partitions are used as main updating partitions for the system. The method for calling the entity information in the updating process can be that searching is firstly carried out in all partitions except the main updating partition based on the to-be-processed search vocabulary, after the corresponding search results are obtained, searching is carried out in the backup updating partition, the corresponding search results are obtained, and the two search results are combined and then are used as final search results. It should be noted that, the updating of the entity information includes various operations such as adding, deleting, etc., and the embodiments of the present disclosure are not limited specifically.
S430, determining at least one associated search term associated with the at least one search term to be processed.
S440, processing the current associated search vocabulary based on a pre-trained target distributed representation model aiming at each associated search vocabulary to obtain a word vector to be associated of the current associated search vocabulary.
S450, determining the entity information to be processed according to at least one feature vector corresponding to the entity information to be matched and the current word vector to be associated, which are stored in the database in advance, aiming at each word vector to be associated.
S460, determining target entity information according to the information of each entity to be processed,
Based on the above technical solution, the above steps are only exemplary, and no sequential information exists.
According to the technical scheme, different types of entity information are stored in a differentiated mode, so that the calling efficiency can be improved when corresponding entity information is called from the database, and further, the problem of inaccurate search results in the entity information updating process is avoided when the entity information is updated according to a specific updating strategy.
Example five
Fig. 5 is a block diagram of an information processing apparatus according to a fifth embodiment of the present disclosure, which is capable of executing the information processing method according to any embodiment of the present disclosure, and has functional modules and beneficial effects corresponding to the execution method. As shown in fig. 5, the apparatus specifically includes: the search word association determination module 510, the word vector to be associated determination module 520, the entity information to be processed determination module 530, and the target entity information determination module 540.
An associated search term determination module 510 for determining at least one associated search term associated with the at least one pending search term.
The to-be-associated word vector determining module 520 is configured to process, for each associated search word, the current associated search word based on a pre-trained target distributed representation model, to obtain a to-be-associated word vector of the current associated search word.
The entity information to be processed determining module 530 is configured to determine, for each word vector to be associated, entity information to be processed according to at least one feature vector corresponding to each entity information to be matched and a current word vector to be associated, which are stored in the database in advance.
The target entity information determining module 540 is configured to determine target entity information according to each entity information to be processed.
On the basis of the above technical solution, the related search vocabulary determining module 510 includes a search vocabulary determining unit to be used and a related search vocabulary determining unit.
And the to-be-used search vocabulary determining unit is used for extracting the at least one to-be-processed search vocabulary according to a preset vocabulary extraction rule to obtain at least one to-be-used search vocabulary.
And the associated search vocabulary determining unit is used for determining at least one associated search vocabulary according to each search vocabulary to be used.
Based on the above technical solution, the to-be-associated word vector determining module 520 includes a training sample set obtaining unit, a pre-training distributed identification model determining unit, a second training vector determining unit, a correcting unit, and a target distributed representation model training unit.
The training sample set acquisition unit is used for acquiring a first training sample set and a second training sample set; the first training sample set comprises a plurality of first positive sample training data and first negative sample training data, wherein the first positive sample training data comprises pre-training entity information and at least one feature corresponding to the pre-training entity information; the first negative-sample training data includes pre-training entity information and at least one feature not associated with the pre-training entity information.
And the pre-training distributed identification model determining unit is used for training the distributed representation model to be trained based on the first training sample set to obtain a pre-training distributed representation model.
The second training vector determining unit is used for editing the data in the second training sample set into the pre-training distributed representation model to obtain a second training vector corresponding to the data in the second training sample set; wherein, the data in the second training sample set comprises actual entity information corresponding to edited search information and negative sampling editing corresponding to the actual entity information.
And the correction unit is used for correcting the preset loss function in the pre-training distributed representation model based on a back propagation algorithm and the second training vector.
And the target distributed representation model training unit is used for training the pre-training distributed representation model by taking convergence of the preset loss function as a training target so as to train the pre-training distributed representation model to obtain the target distributed representation model.
Based on the above technical solution, the entity information determining module to be processed 530 includes a similarity value determining unit and an entity information determining unit to be processed.
And the similarity value determining unit is used for determining a similarity value between the current word vector to be associated and at least one feature vector corresponding to each entity information to be matched.
The entity information to be processed determining unit is used for determining entity information to be processed from at least one entity information to be matched according to the similarity value between each feature vector and a preset condition.
Optionally, determining the target entity information according to the entity information to be processed includes at least one of the following ways:
taking the information of each entity to be processed as target entity information; taking repeated entity information in each piece of entity information to be processed as target entity information; if the entity information to be processed comprises repeated entity information to be processed, reserving one of the repeated entity information to be processed, and taking the finally reserved entity information to be processed as target entity information; determining the target entity information according to the similarity value corresponding to each entity information to be processed and the quantity of the associated search words; and taking the entity information to be processed corresponding to each associated search word and having the highest similarity value as target entity information.
Optionally, the information processing apparatus further includes a display module.
And the display module is used for displaying the target entity information in the target display area after the target entity information is determined.
Optionally, the display module is further configured to display, on a target control in the target display area, target entity information with a highest similarity value; and hiding and displaying the residual target entity information in a menu corresponding to the target control, so as to display the residual target entity information when the triggering of the menu is detected.
Optionally, the display module is further configured to hide and display the remaining target entity information in a menu corresponding to the target control according to a corresponding similarity value sequence, so that when triggering of the menu is detected, the target entity information is displayed in a list according to the sequence.
Optionally, the information processing apparatus further includes a partition module and an update module.
And the partition module is used for determining entity types of the entity information to be matched, and storing the corresponding entity information to be matched in a partition mode according to the data quantity corresponding to each entity type so as to determine the entity information to be processed corresponding to the associated search vocabulary from each partition.
Optionally, the partition module is further configured to determine an actual data amount corresponding to each entity class; storing entity information to be matched corresponding to a target entity class with the actual data quantity smaller than a first preset data quantity threshold value into a first partition; storing entity information to be matched corresponding to a target entity class with the actual data volume larger than the first preset data volume threshold and smaller than the second preset data volume threshold into a second partition; and storing entity information to be matched corresponding to the target entity class with the actual data quantity larger than the second preset data quantity threshold value into a third partition, and storing the entity information in the third partition in a layered manner.
Optionally, the partition module is further configured to store the entity information to be matched in the third partition in a layered manner according to the calling frequency of the entity information to be matched.
The updating module is used for adding the entity information identifier and the partition identifier corresponding to the updated data into the target updating list when detecting that the data of at least one partition is updated; and updating the corresponding entity information in the database according to the update type in the target update list.
According to the technical scheme provided by the embodiment, the associated search vocabulary can be determined according to the search vocabulary to be processed, the associated search vocabulary is processed based on the target distributed representation model to obtain the word vector to be associated, and the characteristics of the search vocabulary to be processed are determined in a distributed mode in the computer; according to the feature vector and the word vector to be associated which are stored in the database in advance, the entity information to be processed is determined, and the target entity information is determined according to the entity information to be processed, so that the matching result related to the search vocabulary can be recalled with stronger fuzzy matching capability, the full utilization of a large amount of entity information in the database is realized, and the search efficiency and the use experience of a user are improved.
The information processing device provided by the embodiment of the disclosure can execute the information processing method provided by any embodiment of the disclosure, and has the corresponding functional modules and beneficial effects of the execution method.
It should be noted that each unit and module included in the above apparatus are only divided according to the functional logic, but not limited to the above division, so long as the corresponding functions can be implemented; in addition, the specific names of the functional units are also only for convenience of distinguishing from each other, and are not used to limit the protection scope of the embodiments of the present disclosure.
Example six
Fig. 6 is a schematic structural diagram of an electronic device according to a sixth embodiment of the disclosure. Referring now to fig. 6, a schematic diagram of an electronic device (e.g., a terminal device or server in fig. 1) 600 suitable for use in implementing embodiments of the present disclosure is shown. The terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 6 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.
As shown in fig. 6, the electronic device 600 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 601, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage means 606 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other through a bus 604. An edit/output (I/O) interface 605 is also connected to bus 604.
In general, the following devices may be connected to the I/O interface 605: editing device 606 including, for example, a touch screen, a touch pad, a keyboard, a mouse, a camera, a microphone, an accelerometer, a gyroscope, and the like; an output device 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 606 including, for example, magnetic tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 shows an electronic device 600 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.
In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via communication means 609, or from storage means 606, or from ROM 602. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing device 601.
The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.
The electronic device provided by the embodiment of the present disclosure and the information processing method provided by the foregoing embodiment belong to the same inventive concept, and technical details not described in detail in the present embodiment can be referred to the foregoing embodiment, and the present embodiment has the same beneficial effects as the foregoing embodiment.
Example seven
The present disclosure provides a computer storage medium having stored thereon a computer program which, when executed by a processor, implements the information processing method provided by the above embodiments.
It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.
In some embodiments, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.
The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to:
determining at least one associated search term associated with the at least one search term to be processed;
aiming at each associated search word, processing the current associated search word based on a pre-trained target distributed representation model to obtain a word vector to be associated of the current associated search word;
Determining entity information to be processed according to at least one feature vector corresponding to the entity information to be matched and the current word vector to be associated, which are stored in a database in advance, aiming at each word vector to be associated;
And determining target entity information according to the entity information to be processed.
Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. The name of the unit does not in any way constitute a limitation of the unit itself, for example the first acquisition unit may also be described as "unit acquiring at least two internet protocol addresses".
The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
According to one or more embodiments of the present disclosure, there is provided an information processing method, including:
determining at least one associated search term associated with the at least one search term to be processed;
aiming at each associated search word, processing the current associated search word based on a pre-trained target distributed representation model to obtain a word vector to be associated of the current associated search word;
Determining entity information to be processed according to at least one feature vector corresponding to the entity information to be matched and the current word vector to be associated, which are stored in a database in advance, aiming at each word vector to be associated;
And determining target entity information according to the entity information to be processed.
According to one or more embodiments of the present disclosure, there is provided an information processing method [ example two ] further including:
optionally, extracting the at least one search word to be processed according to a preset word extraction rule to obtain at least one search word to be used;
and determining at least one associated search word according to each search word to be used.
According to one or more embodiments of the present disclosure, there is provided an information processing method [ example three ], further comprising:
Optionally, training to obtain the target distributed representation model;
The training results in the target distributed representation model comprising:
Acquiring a first training sample set and a second training sample set; the first training sample set comprises a plurality of first positive sample training data and first negative sample training data, wherein the first positive sample training data comprises pre-training entity information and at least one feature corresponding to the pre-training entity information; the first negative-sample training data includes pre-training entity information and at least one feature not associated with the pre-training entity information;
training the to-be-trained distributed representation model based on the first training sample set to obtain a pre-trained distributed representation model;
Editing data in a second training sample set into the pre-training distributed representation model to obtain a second training vector corresponding to the data in the second training sample set; wherein, the data in the second training sample set comprises actual entity information corresponding to edited search information and negative sampling editing corresponding to the actual entity information;
Correcting a preset loss function in the pre-training distributed representation model based on a back propagation algorithm and the second training vector;
and training the pre-training distributed representation model by taking convergence of the preset loss function as a training target so as to obtain the target distributed representation model through training.
According to one or more embodiments of the present disclosure, there is provided an information processing method [ example four ] further including:
Optionally, determining a similarity value between the current word vector to be associated and at least one feature vector corresponding to each entity information to be matched;
And determining the entity information to be processed from at least one entity information to be matched according to the similarity value between each feature vector and the preset condition.
According to one or more embodiments of the present disclosure, there is provided an information processing method [ example five ]:
Optionally, taking each entity information to be processed as target entity information;
taking repeated entity information in each piece of entity information to be processed as target entity information;
If the entity information to be processed comprises repeated entity information to be processed, reserving one of the repeated entity information to be processed, and taking the finally reserved entity information to be processed as target entity information;
Determining the target entity information according to the similarity value corresponding to each entity information to be processed and the quantity of the associated search words;
And taking the entity information to be processed corresponding to each associated search word and having the highest similarity value as target entity information.
According to one or more embodiments of the present disclosure, there is provided an information processing method [ example six ], further comprising:
optionally, the target entity information is presented in a target display area.
According to one or more embodiments of the present disclosure, there is provided an information processing method [ example seventh ], further comprising:
Optionally, displaying the target entity information with the highest similarity value on a target control in the target display area;
And hiding and displaying the residual target entity information in a menu corresponding to the target control, so as to display the residual target entity information when the triggering of the menu is detected.
According to one or more embodiments of the present disclosure, there is provided an information processing method [ example eight ]:
optionally, the rest target entity information is hidden and displayed in a menu corresponding to the target control according to the corresponding similarity value sequence, so that when the triggering of the menu is detected, the target entity information is displayed in a list according to the sequence.
According to one or more embodiments of the present disclosure, there is provided an information processing method, further comprising:
Optionally, determining entity types of the entity information to be matched, and storing the corresponding entity information to be matched in a partition according to the data amount corresponding to each entity type, so as to determine the entity information to be processed corresponding to the associated search vocabulary from each partition.
According to one or more embodiments of the present disclosure, there is provided an information processing method, further comprising:
Optionally, determining an actual data volume corresponding to each entity class;
Storing entity information to be matched corresponding to a target entity class with the actual data quantity smaller than a first preset data quantity threshold value into a first partition;
storing entity information to be matched corresponding to a target entity class with the actual data volume larger than the first preset data volume threshold and smaller than the second preset data volume threshold into a second partition;
and storing entity information to be matched corresponding to the target entity class with the actual data quantity larger than the second preset data quantity threshold value into a third partition, and storing the entity information in the third partition in a layered manner.
According to one or more embodiments of the present disclosure, there is provided an information processing method [ example eleven ], further comprising:
optionally, according to the calling frequency of the entity information to be matched, the entity information to be matched is stored in the third partition in a layered manner.
According to one or more embodiments of the present disclosure, there is provided an information processing method [ example twelve ], further comprising:
Optionally, when detecting that at least one partition has data update, adding an entity information identifier and a partition identifier corresponding to the updated data to a target update list;
and updating the corresponding entity information in the database according to the update type in the target update list.
According to one or more embodiments of the present disclosure, there is provided an information processing apparatus [ example thirteenth ], the apparatus including:
an associated search vocabulary determination module for determining at least one associated search vocabulary associated with the at least one search vocabulary to be processed;
The to-be-associated word vector determining module is used for processing the current associated search word according to the pre-trained target distributed representation model aiming at each associated search word to obtain the to-be-associated word vector of the current associated search word;
The entity information to be processed determining module is used for determining entity information to be processed according to at least one feature vector corresponding to the entity information to be matched and the current word vector to be associated, which are stored in a database in advance, for each word vector to be associated;
and the target entity information determining module is used for determining target entity information according to the entity information to be processed.
The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).
Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.

Claims (14)

1. An information processing method, characterized by comprising:
Determining at least one associated search word associated with at least one to-be-processed search word, wherein the associated search word is the to-be-processed search word or a word obtained by disassembling the to-be-processed search word;
aiming at each associated search word, processing the current associated search word based on a pre-trained target distributed representation model to obtain a word vector to be associated of the current associated search word;
Determining entity information to be processed according to at least one feature vector corresponding to the entity information to be matched and the current word vector to be associated, which are stored in a database in advance, aiming at each word vector to be associated;
determining target entity information according to the entity information to be processed;
The information processing method further comprises the following steps:
determining entity types of the entity information to be matched, and storing corresponding entity information to be matched in partitions according to data amounts corresponding to the entity types so as to determine entity information to be processed corresponding to the associated search vocabulary from each partition;
the information processing method further comprises the following steps: training to obtain the target distributed representation model;
The training results in the target distributed representation model comprising:
Acquiring a first training sample set and a second training sample set; the first training sample set comprises a plurality of first positive sample training data and first negative sample training data, wherein the first positive sample training data comprises pre-training entity information and at least one feature corresponding to the pre-training entity information; the first negative-sample training data includes pre-training entity information and at least one feature not associated with the pre-training entity information;
training the to-be-trained distributed representation model based on the first training sample set to obtain a pre-trained distributed representation model;
Editing data in a second training sample set into the pre-training distributed representation model to obtain a second training vector corresponding to the data in the second training sample set; wherein, the data in the second training sample set comprises an actual input, actual entity information corresponding to the actual input and a negative sampling input corresponding to the actual entity;
and training based on the second training vector to obtain the target distributed representation model.
2. The method of claim 1, wherein the determining at least one associated search term associated with the at least one pending search term comprises:
extracting the at least one search word to be processed according to a preset word extraction rule to obtain at least one search word to be used;
and determining at least one associated search word according to each search word to be used.
3. The method according to claim 1, characterized in that it comprises: the training based on the second training vector to obtain the target distributed representation model includes:
Correcting a preset loss function in the pre-training distributed representation model based on a back propagation algorithm and the second training vector;
and training the pre-training distributed representation model by taking convergence of the preset loss function as a training target so as to obtain the target distributed representation model through training.
4. The method according to claim 1, wherein the determining the entity information to be processed according to at least one feature vector corresponding to each entity information to be matched and the current word vector to be associated stored in the database includes:
determining a similarity value between a current word vector to be associated and at least one feature vector corresponding to each entity information to be matched;
And determining the entity information to be processed from at least one entity information to be matched according to the similarity value between each feature vector and the preset condition.
5. The method of claim 1, wherein determining target entity information based on each entity information to be processed comprises at least one of:
Taking the information of each entity to be processed as target entity information;
taking repeated entity information in each piece of entity information to be processed as target entity information;
If the entity information to be processed comprises repeated entity information to be processed, reserving one of the repeated entity information to be processed, and taking the finally reserved entity information to be processed as target entity information;
Determining the target entity information according to the similarity value corresponding to each entity information to be processed and the quantity of the associated search words;
And taking the entity information to be processed corresponding to each associated search word and having the highest similarity value as target entity information.
6. The method of claim 1, further comprising, after determining the target entity information:
the target entity information is presented in a target display area.
7. The method of claim 6, wherein the presenting the target entity information in the target display area comprises:
displaying the target entity information with the highest similarity value on a target control in the target display area;
And hiding and displaying the residual target entity information in a menu corresponding to the target control, so as to display the residual target entity information when the triggering of the menu is detected.
8. The method of claim 7, wherein the hiding the remaining target entity information from the menu corresponding to the target control comprises:
And hiding and displaying the rest target entity information in a menu corresponding to the target control according to the corresponding similarity value sequence, so that the target entity information is sequentially displayed in a list when the triggering of the menu is detected.
9. The method of claim 1, wherein the storing the corresponding entity information partition to be matched according to the data amount corresponding to each entity class includes:
Determining the actual data quantity corresponding to each entity class;
Storing entity information to be matched corresponding to a target entity class with the actual data quantity smaller than a first preset data quantity threshold value into a first partition;
storing entity information to be matched corresponding to a target entity class with the actual data volume larger than the first preset data volume threshold and smaller than the second preset data volume threshold into a second partition;
and storing entity information to be matched corresponding to the target entity class with the actual data quantity larger than the second preset data quantity threshold value into a third partition, and storing the entity information in the third partition in a layered manner.
10. The method of claim 9, wherein the hierarchically storing in the third partition comprises:
And according to the calling frequency of the entity information to be matched, storing the entity information to be matched in the third partition in a layered manner.
11. The method as recited in claim 1, further comprising:
When at least one partition is detected to have data update, adding an entity information identifier and a partition identifier corresponding to the updated data into a target update list;
and updating the corresponding entity information in the database according to the update type in the target update list.
12. An information processing apparatus, characterized by comprising:
The related search vocabulary determining module is used for determining at least one related search vocabulary related to at least one to-be-processed search vocabulary, wherein the related search vocabulary is the to-be-processed search vocabulary or a vocabulary obtained by disassembling the to-be-processed search vocabulary;
The to-be-associated word vector determining module is used for processing the current associated search word according to the pre-trained target distributed representation model aiming at each associated search word to obtain the to-be-associated word vector of the current associated search word;
The entity information to be processed determining module is used for determining entity information to be processed according to at least one feature vector corresponding to the entity information to be matched and the current word vector to be associated, which are stored in a database in advance, for each word vector to be associated;
The target entity information determining module is used for determining target entity information according to each entity information to be processed;
The information processing device further comprises a partition module, wherein the partition module is used for determining entity types of the entity information to be matched, and storing the corresponding entity information to be matched in a partition mode according to data quantity corresponding to each entity type so as to determine the entity information to be processed corresponding to the associated search vocabulary from each partition;
the word vector determining module to be associated comprises a training sample set obtaining unit, a pre-training distributed representation model determining unit, a second training vector determining unit and a target distributed representation model training unit; wherein,
The training sample set acquisition unit is used for acquiring a first training sample set and a second training sample set; the first training sample set comprises a plurality of first positive sample training data and first negative sample training data, wherein the first positive sample training data comprises pre-training entity information and at least one feature corresponding to the pre-training entity information; the first negative-sample training data includes pre-training entity information and at least one feature not associated with the pre-training entity information;
The pre-training distributed representation model determining unit is used for training a distributed representation model to be trained based on the first training sample set to obtain a pre-training distributed representation model;
The second training vector determining unit is configured to edit data in a second training sample set into the pre-training distributed representation model, to obtain a second training vector corresponding to the data in the second training sample set; wherein, the data in the second training sample set comprises an actual input, actual entity information corresponding to the actual input and a negative sampling input corresponding to the actual entity;
The target distributed representation model training unit is used for obtaining the target distributed representation model based on the second training vector training.
13. An electronic device, the electronic device comprising:
one or more processors;
Storage means for storing one or more programs,
The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the information processing method of any of claims 1-11.
14. A storage medium containing computer executable instructions for performing the information processing method of any of claims 1-11 when executed by a computer processor.
CN202110633649.7A 2021-06-07 2021-06-07 Information processing method, information processing device, electronic equipment and storage medium Active CN113239257B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110633649.7A CN113239257B (en) 2021-06-07 2021-06-07 Information processing method, information processing device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110633649.7A CN113239257B (en) 2021-06-07 2021-06-07 Information processing method, information processing device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113239257A CN113239257A (en) 2021-08-10
CN113239257B true CN113239257B (en) 2024-05-14

Family

ID=77137084

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110633649.7A Active CN113239257B (en) 2021-06-07 2021-06-07 Information processing method, information processing device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113239257B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106407280A (en) * 2016-08-26 2017-02-15 合网络技术(北京)有限公司 Query target matching method and device
CN108959252A (en) * 2018-06-28 2018-12-07 中国人民解放军国防科技大学 Semi-supervised Chinese named entity recognition method based on deep learning
CN109241294A (en) * 2018-08-29 2019-01-18 国信优易数据有限公司 A kind of entity link method and device
CN109446399A (en) * 2018-10-16 2019-03-08 北京信息科技大学 A kind of video display entity search method
CN109902156A (en) * 2019-01-09 2019-06-18 北京小乘网络科技有限公司 Entity search method, storage medium and electronic equipment
CN110162782A (en) * 2019-04-17 2019-08-23 平安科技(深圳)有限公司 Entity extraction method, apparatus, equipment and storage medium based on Medical Dictionary
CN111310456A (en) * 2020-02-13 2020-06-19 支付宝(杭州)信息技术有限公司 Entity name matching method, device and equipment
CN111753551A (en) * 2020-06-29 2020-10-09 北京字节跳动网络技术有限公司 Information generation method and device based on word vector generation model

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10437233B2 (en) * 2017-07-20 2019-10-08 Accenture Global Solutions Limited Determination of task automation using natural language processing

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106407280A (en) * 2016-08-26 2017-02-15 合网络技术(北京)有限公司 Query target matching method and device
CN108959252A (en) * 2018-06-28 2018-12-07 中国人民解放军国防科技大学 Semi-supervised Chinese named entity recognition method based on deep learning
CN109241294A (en) * 2018-08-29 2019-01-18 国信优易数据有限公司 A kind of entity link method and device
CN109446399A (en) * 2018-10-16 2019-03-08 北京信息科技大学 A kind of video display entity search method
CN109902156A (en) * 2019-01-09 2019-06-18 北京小乘网络科技有限公司 Entity search method, storage medium and electronic equipment
CN110162782A (en) * 2019-04-17 2019-08-23 平安科技(深圳)有限公司 Entity extraction method, apparatus, equipment and storage medium based on Medical Dictionary
CN111310456A (en) * 2020-02-13 2020-06-19 支付宝(杭州)信息技术有限公司 Entity name matching method, device and equipment
CN111753551A (en) * 2020-06-29 2020-10-09 北京字节跳动网络技术有限公司 Information generation method and device based on word vector generation model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CorDEL: A Contrastive Deep Learning Approach for Entity Linkage;Zhengyang Wang等;《2020 IEEE International Conference on Data Mining》;20210109;1-18 *
金融知识自动问答中的新词发现及答案排序方法;张长;《中国优秀硕士学位论文全文数据库 信息科技辑》;20190115;I138-5142 *

Also Published As

Publication number Publication date
CN113239257A (en) 2021-08-10

Similar Documents

Publication Publication Date Title
JP6718828B2 (en) Information input method and device
CN110019732B (en) Intelligent question answering method and related device
CN106776544B (en) Character relation recognition method and device and word segmentation method
CN112115706B (en) Text processing method and device, electronic equipment and medium
CN108985358B (en) Emotion recognition method, device, equipment and storage medium
CN110969012B (en) Text error correction method and device, storage medium and electronic equipment
JP2021089739A (en) Question answering method and language model training method, apparatus, device, and storage medium
WO2021135319A1 (en) Deep learning based text generation method and apparatus and electronic device
CN116521841B (en) Method, device, equipment and medium for generating reply information
US11669679B2 (en) Text sequence generating method and apparatus, device and medium
CN114861889B (en) Deep learning model training method, target object detection method and device
CN111104516B (en) Text classification method and device and electronic equipment
WO2024036616A1 (en) Terminal-based question and answer method and apparatus
CN110738056B (en) Method and device for generating information
WO2024179519A1 (en) Semantic recognition method and apparatus
CN112069786A (en) Text information processing method and device, electronic equipment and medium
CN113239257B (en) Information processing method, information processing device, electronic equipment and storage medium
CN114881008B (en) Text generation method and device, electronic equipment and medium
CN116187301A (en) Model generation method, entity identification device, electronic equipment and storage medium
CN111221424B (en) Method, apparatus, electronic device, and computer-readable medium for generating information
CN114218431A (en) Video searching method and device, electronic equipment and storage medium
CN109857838B (en) Method and apparatus for generating information
CN113822039A (en) Method and related equipment for mining similar meaning words
CN117892724B (en) Text detection method, device, equipment and storage medium
CN112148751A (en) Method and device for querying data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant