CN110263347A - A kind of construction method and relevant apparatus of synonym - Google Patents

A kind of construction method and relevant apparatus of synonym Download PDF

Info

Publication number
CN110263347A
CN110263347A CN201910570705.XA CN201910570705A CN110263347A CN 110263347 A CN110263347 A CN 110263347A CN 201910570705 A CN201910570705 A CN 201910570705A CN 110263347 A CN110263347 A CN 110263347A
Authority
CN
China
Prior art keywords
word
corpus
term vector
synonym
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910570705.XA
Other languages
Chinese (zh)
Inventor
康战辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201910570705.XA priority Critical patent/CN110263347A/en
Publication of CN110263347A publication Critical patent/CN110263347A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

This application discloses a kind of construction method of synonym and relevant apparatus, for constructing synonym, and need not rely on enough user behaviors, applied widely.The application method includes: the first corpus obtained Jing Guo word segmentation processing, and first corpus is the set comprising synonym sentence;First corpus is handled according to preset term vector computation model, obtains the term vector for indicating each word in first corpus;The term vector is handled according to preset supervised learning model, the supervised learning model is used to improve the term vector similarity between semantic identical word;Synonym is determined according to the term vector after the supervised learning model treatment.

Description

A kind of construction method and relevant apparatus of synonym
Technical field
This application involves technical field of data processing more particularly to the construction methods and relevant apparatus of a kind of synonym.
Background technique
Synonymicon is a kind of common query facility.Currently, there are mainly two types of the methods of building synonymicon.The One is linguist according to modern Chinese dictionary paraphrase manual sorting, and second is by modern search engines, according to searching Rope click logs do computer automatic aligning, to obtain synonym pair, finally using manually deleting choosing obtains synonymicon.
Search click logs refer to, click the set of the different inquiry corpus of same piece document.For example, user passes through " how grilled chicken wing is nice " and " how grilled chicken wing is nice " the two inquiry expectations search same piece document, and to this document It is clicked.That two inquiry expectations just constitute search click logs.The method of existing building synonymicon is to pass through word Alignment techniques are expected to handle to the two inquiries, eventually determine " how " and " how " it is a synonym pair.
Obviously, when the same piece document there are enough user behaviors, made is expected there are different inquiries, the above method Just it is applicable in.For example, it is assumed that use " National Day " and " 11 " the two inquiries to expect to click on same piece document without user, that To be difficult to out " National Day " and " 11 " is synonym.
Summary of the invention
The embodiment of the present application provides the construction method and relevant apparatus of a kind of synonym, for constructing synonym, and not Need to rely on enough user behaviors, it is applied widely.
In view of this, the application first aspect provides a kind of construction method of synonym, comprising:
The first corpus Jing Guo word segmentation processing is obtained, first corpus is the set comprising synonym sentence;
First corpus is handled according to preset term vector computation model, obtains indicating in first corpus The term vector of each word;
The term vector is handled according to preset supervised learning model, the supervised learning model is for improving language Term vector similarity between the identical word of justice;
Synonym is determined according to the term vector after the supervised learning model treatment.
The application second aspect provides a kind of construction device of synonym, comprising:
Acquiring unit, for obtaining the first corpus Jing Guo word segmentation processing, first corpus is to include synonym sentence Set;
First processing units are obtained for being handled according to preset term vector computation model first corpus Indicate the term vector of each word in first corpus;
The second processing unit, for being handled according to preset supervised learning model the term vector, the supervision Learning model is used to improve the term vector similarity between semantic identical word;
Determination unit, for determining synonym according to the term vector after the supervised learning model treatment.
In a kind of possible design, in the first implementation of the second aspect of the embodiment of the present application,
The supervised learning model is the deep learning model based on loss function Triplet Loss;
Described the second processing unit, for choosing at least one set of term vector from all term vectors, every group of term vector includes Three term vectors, and just using three term vectors in every group as positive example, negative example and the anchor in the deep learning model Example, the word and the positive example that the anchor positive example indicates be that the phrase semantic that indicates is identical, the word that the anchor positive example indicates with The phrase semantic that the negative example indicates is different;
Every group of term vector is handled according to the deep learning model.
In a kind of possible design, in second of implementation of the second aspect of the embodiment of the present application,
The place second manages unit and is used for,
Multiple groups word is selected from all words of first corpus, every group of word includes the second word, third word With the 4th word;
It obtains comprising the second corpus of second word, the third corpus comprising the third word and comprising described the 4th corpus of four words;
If the corpus similarity of second corpus and the third corpus is greater than default first similarity, institute will be indicated The term vector of the second word is stated as positive example, the term vector of the third word will be identified as anchor positive example;
If the corpus similarity of the third corpus and the 4th corpus is less than default second similarity, institute will be indicated The term vector of the 4th word is stated as negative example;
Every group of term vector is handled according to the deep learning model.
In a kind of possible design, in the third implementation of the second aspect of the embodiment of the present application,
The corpus similarity is first several and second several ratio, and first number is to be searched by different corpus Rope and the number for clicking same piece article, second number are the number that same piece article is searched by different corpus.
In a kind of possible design, in the 4th kind of implementation of the second aspect of the embodiment of the present application,
The corpus similarity is that the number of same piece article is searched for and clicked by different corpus.
In a kind of possible design, in the 5th kind of implementation of the second aspect of the embodiment of the present application,
Described the second processing unit is used for,
A word is randomly choosed from first corpus as third word;
It selects to be greater than the two of default third similarity with the third word term vector similarity from first corpus A word is respectively as the second word and the 4th word;
It obtains comprising the second corpus of second word, the third corpus comprising the third word and comprising described the 4th corpus of four words;
If the corpus similarity of second corpus and the third corpus is greater than default first similarity, institute will be indicated The term vector of the second word is stated as positive example, the term vector of the third word will be identified as anchor positive example;
If the corpus similarity of the third corpus and the 4th corpus is less than default second similarity, institute will be indicated The term vector of the 4th word is stated as negative example;
Every group of term vector is handled according to the deep learning model.
In a kind of possible design, in the 6th kind of implementation of the second aspect of the embodiment of the present application,
Described the second processing unit is used for,
A word is randomly choosed from first corpus as third word;
It selects to be greater than the two of default third similarity with the third word term vector similarity from first corpus A word is respectively as the second word and the 4th word;
All corpus comprising second word, owning comprising the third word are obtained from search click logs Corpus and all corpus comprising the 4th word;
According to preset automatic aligning algorithm to all corpus comprising second word, include the third word All corpus and all corpus comprising the 4th word are respectively processed, and obtain the second language comprising second word Material, the third corpus comprising the third word and the 4th corpus comprising the 4th word, second corpus, described the Any two in three corpus and the 4th corpus are it is anticipated that remove second word, the third word and the described 4th Other words outside word are all the same;
If the corpus similarity of second corpus and the third corpus is greater than default first similarity, institute will be indicated The term vector of the second word is stated as positive example, the term vector of the third word will be identified as anchor positive example;
If the corpus similarity of the third corpus and the 4th corpus is less than default second similarity, institute will be indicated The term vector of the 4th word is stated as negative example;
Every group of term vector is handled according to the deep learning model.
In a kind of possible design, in the 7th kind of implementation of the second aspect of the embodiment of the present application,
The determination unit is used for,
Obtain the cosine similarity of two term vectors after the supervised learning model treatment;
If the cosine similarity is greater than default 4th similarity, described two term vectors are synonymously.
In a kind of possible design, in the 8th kind of implementation of the second aspect of the embodiment of the present application,
The term vector computation model is word2vec model.
The third aspect of the embodiment of the present application provides a kind of terminal device, comprising: memory, transceiver, processor and Bus system;Wherein, the memory is for storing program;The processor is used to execute the finger in the memory It enables, so that the communication device realizes the function such as the construction device of synonym in aforementioned second aspect.
The fourth aspect of the embodiment of the present application provides a kind of computer readable storage medium, the computer-readable storage Instruction is stored in medium, when run on a computer, so that computer is executed such as synonym in aforementioned second aspect The function of construction device.
As can be seen from the above technical solutions, the embodiment of the present application has the advantage that
The first corpus Jing Guo word segmentation processing is first obtained, the first corpus is the set comprising synonym sentence;Then basis Preset term vector computation model handles the first corpus, obtain indicate the first corpus in each word term vector, word to Similarity between amount represents the degree of correlation between word;Further according to preset supervised learning model to term vector at Reason, so that the term vector similarity between semantic identical word is improved;Finally according to the word after supervised learning model treatment Vector determines synonym.The determination process of the synonym is realized by term vector computation model, is not need to rely on enough User behavior, even if without user use " National Day " and " 11 " the two inquiry expect click on same piece document, It wants the first corpus to obtain quantity enough, also can recognize that " National Day " and " 11 " is synonym.
Detailed description of the invention
Fig. 1 is the configuration diagram that synonym constructs system in the embodiment of the present application;
Fig. 2 is a kind of one embodiment schematic diagram of the construction method of synonym provided by the embodiments of the present application;
Fig. 3 is to carry out processing one embodiment to term vector according to preset supervised learning model in the embodiment of the present application to show It is intended to;
Fig. 4 is the embodiment schematic diagram for choosing at least one set of term vector in the embodiment of the present application from term vector;
Fig. 5 is that the implementation for the method for selecting multiple groups word in the embodiment of the present application from all words of the first corpus illustrates It is intended to;
Fig. 6 is a kind of structural schematic diagram of one embodiment of the construction device of synonym provided by the embodiments of the present application;
Fig. 7 is a kind of structural schematic diagram of one embodiment of terminal device provided by the embodiments of the present application.
Specific embodiment
The embodiment of the present application provides the construction method and relevant apparatus of a kind of synonym, for constructing synonym, and not Need to rely on enough user behaviors, it is applied widely.
The description and claims of this application and term " first ", " second ", " third ", " in above-mentioned attached drawing The (if present)s such as four " are to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should manage The data that solution uses in this way are interchangeable under appropriate circumstances, so that embodiments herein described herein for example can be to remove Sequence other than those of illustrating or describe herein is implemented.In addition, term " includes " and " corresponding to " and their times What is deformed, it is intended that cover it is non-exclusive include, for example, contain the process, method of a series of steps or units, system, Product or equipment those of are not necessarily limited to be clearly listed step or unit, but may include be not clearly listed or for The intrinsic other step or units of these process, methods, product or equipment.
It should be understood that this method is applied to synonym structure shown in FIG. 1 present applicant proposes a kind of construction method of synonym System is built, referring to Fig. 1, Fig. 1 is the configuration diagram that synonym constructs system in the embodiment of the present application, as shown, The construction method of synonym provided by the present application can be applied to server, also can be applied to terminal device, terminal device packet Contain but be not limited only to tablet computer, laptop, palm PC, mobile phone, interactive voice equipment and PC (personal Computer, PC), herein without limitation.
In order to make it easy to understand, a referring to Fig. 2, a kind of reality of the construction method of synonym provided by the embodiments of the present application Illustration is applied to be intended to, this method comprises:
201, the first corpus Jing Guo word segmentation processing is obtained, the first corpus is the set comprising synonym sentence.
It in the embodiment of the present application, may include the diversified forms such as word, phrase and long sentence, example in the first corpus Such as, it may include word " National Day ", may include phrase " National Day is happy ", can also include that " all where vacation on National Day for long sentence It plays ".
Since the object of the application processing is word, so the first corpus obtained is by word segmentation processing, at participle Reason belongs to the prior art, therefore is not detailed, and after word segmentation processing, phrase " National Day is happy " then becomes " National Day " and " happy " Two words.And in order to reduce the accuracy of the difficulty of word segmentation processing and word segmentation processing, the embodiment of the present application can select text Chapter title or public platform title, because the format standard of comparison of title, easily identifies.
In addition, referring to that there are two sentences include synonym, such as " 11 is happy " and " National Day is fast comprising synonym sentence It is happy " just comprising a pair of of synonym, wherein " 11 " and " National Day " are synonym.
It is understood that the quantity of synonym sentence is generally enough in the first corpus in order to construct synonym.
202, the first corpus is handled according to preset term vector computation model, obtains indicating each word in the first corpus The term vector of language.
Term vector refers to that the word or expression of vocabulary is mapped to the vector of real number, and wherein term vector computation model has more It plants, including probabilistic model and neural network etc., in the embodiment of the present application, term vector computation model can be according to the upper and lower of word The features such as text and the meaning of a word measure the similitude between word, word are then mapped to term vector, therefore can use term vector phase It goes to measure the degree of correlation between corresponding word like degree.
203, term vector is handled according to preset supervised learning model, supervised learning model is for improving semantic phase With the term vector similarity between word.
It should be noted that can be indicated between word by the term vector similarity that the processing of term vector computation model obtains Degree of correlation, following table is the implementation of term vector similarity using term vector computation model to obtaining after the processing of the first corpus Example.
In the table, first two columns represent some words relevant to word " technology " and the corresponding word of these words to The term vector similarity between term vector corresponding with " technology " is measured, the numerical value in table then represents term vector similarity;Similarly, Next two columns represent and word " Spring Festival " the relevant some words and corresponding term vector of these words word corresponding with " Spring Festival " Term vector similarity between vector.
Here by taking word " Spring Festival " as an example, it can be seen that " New Year's Day ", " National Day " and " New Year " three from term vector similarity The term vector similarity in word and " Spring Festival " all relatively, if determining synonym according to the term vector similarity, it will accidentally " Spring Festival " and " National Day " synonym will be determined as, therefore, it can be seen that according to by obtaining after the processing of term vector computation model Term vector determines synonym, has certain error.
Therefore, the embodiment of the present application carries out secondary treatment to term vector using supervised learning model, it should be noted that prison Superintending and directing learning model is to handle on the basis of supervised learning term vector, so that the mould of term vector or direction change, To reach the term vector similarity between improving semantic identical word;Wherein the type of supervised learning model has very much, does not do herein It is specific to limit.
For example, it is assumed that the term vector in " Spring Festival " is (1,2) after the processing of term vector computation model, the word of " New Year " to Amount is (3,2), and pass through after the processing of supervised learning model, and the term vector in " Spring Festival " may still (1,2), but " New Year " Term vector becomes (2,2), it is seen that the term vector similarity in " Spring Festival " and " New Year " becomes larger, to be conducive to identify this to synonymous Word.
It will be appreciated that the term vector of the example above only includes two dimensions, in practical applications, due to word There are many amount, so the dimension of term vector, which can be greater than, is even much larger than two dimensions.
Following table is the embodiment using the term vector similarity obtained after supervised learning model treatment.
As can be seen from the above table, the term vector similarity in " Spring Festival " and " New Year " by original 0.757 promoted to 0.941, meanwhile, term vector similarity between " National Day " and " Spring Festival " also due to the adjustment of term vector becomes lower so that In not coming in above-mentioned list.
204, synonym is determined according to the term vector after supervised learning model treatment.
It should be noted that since the representation method of term vector similarity has very much, so being determined according to term vector synonymous The method of word also has very much, is not specifically limited herein.
In the embodiment of the present application, the determination process of synonym is realized by term vector computation model, do not need according to The enough user behaviors of Lai Yu, even if using " National Day " and " 11 " the two inquiries to expect to click on a same piece without user Document also can recognize that " National Day " and " 11 " is synonym as long as to obtain quantity enough for the first corpus.
Further, referring to Fig. 3, in the embodiment of the present application according to preset supervised learning model to term vector at One embodiment schematic diagram is managed, is specifically included:
Supervised learning model can be the deep learning model based on loss function Triplet Loss, implement in the application In example, without limitation to the specific network structure of deep learning model.
When supervised learning model can be the deep learning model based on loss function Triplet Loss, according to preset Supervised learning model to term vector carry out processing include:
301, at least one set of term vector is chosen from all term vectors, every group of term vector includes three term vectors, and will be every Word that three term vectors in group are indicated respectively as positive example, negative example and the anchor positive example in deep learning model, anchor positive example and Positive example is that the phrase semantic of expression is identical, and the phrase semantic that the word and negative example that anchor positive example indicates indicate is different.
It should be noted that positive example, negative example can be expressed respectively with certain feature in loss function Triplet Loss With anchor positive example, loss function Triplet Loss can allow the distance between positive example and anchor positive example feature representation as small as possible, and bear The distance between the feature representation of example and anchor positive example is as big as possible, and to allow the distance between positive example and anchor positive example and negative example and There is a smallest interval between the distance between anchor positive example, the size at the interval can specifically be set according to actual needs It sets.
In the embodiment of the present application, it is assumed that one group of term vector is only chosen, is only capable of being adjusted this group of term vector, and in order to make Determining synonym is more accurate, needs to adjust more term vectors as far as possible, thus can select as much as possible multiple groups word to Amount.
302, every group of term vector is handled according to deep learning model.
It is understood that by the processing of deep learning model, two big word equivalent vector similarities of similarity It can be improved, and the small corresponding vector similarity of two words of similarity can be reduced.
Further, at least one set of term vector is chosen from all term vectors, every group of term vector includes three term vectors, and It may include a variety of sides using three term vectors in every group as positive example, negative example and the anchor positive example in deep learning model Formula, guarantee word that anchor positive example indicates and positive example be indicate phrase semantic is identical and the word and negative example table of the expression of anchor positive example Under the premise of the phrase semantic that shows is different, can by way of being manually entered, such as be manually entered " New Year ", " National Day " and The corresponding term vector of " Spring Festival " three words is respectively in addition to this positive example, negative example and anchor positive example can also set certain rule Then chosen, it below will be with a kind of embodiment introduction, referring to Fig. 4, being chosen at least from term vector in the embodiment of the present application The embodiment schematic diagram of one group of term vector, specifically includes:
401, multiple groups word is selected from all words of the first corpus, every group of word includes the second word, third word With the 4th word.
It should be noted that the process for selecting multiple groups word can be random, it is also possible to select according to certain rules It selects, the embodiment of the present application is not specifically limited the method for selection.
402, it obtains the second corpus comprising the second word, the third corpus comprising third word and includes the 4th word 4th corpus.
In the embodiment of the present application, the concrete form of the second corpus, third corpus and the 4th corpus all can be word, short Language and sentence;For example, it is assumed that the second word is " mid-autumn ", then the second corpus obtained can be " mid-autumn ", can be " in The autumn is happy ", can be " mid-autumn reunion ", or " mid-autumn is coming ".
It should be noted that the embodiment of the present application is to the specific method for obtaining the second corpus, third corpus and the 4th corpus Also without limitation with source, it such as can be selected, can also be crawled from webpage from database.
403, if the corpus similarity of the second corpus and third corpus is greater than default first similarity, it will indicate the second word The term vector of language will identify the term vector of third word as anchor positive example as positive example.
It is noted that corpus similarity has kind of a calculation, the embodiment of the present application is not limited this.
In the embodiment of the present application, judged according to the corpus where word, if the second corpus and third corpus Corpus similarity is greater than the first similarity, then it is assumed that the second corpus and the expression of third corpus are same semantemes, and then think the Two words and third word are synonymous, so using corresponding two term vectors of the second word and third word as positive example With anchor positive example.
404, if the corpus similarity of third corpus and the 4th corpus is less than default second similarity, it will indicate the 4th word The term vector of language is as negative example.
Because abovementioned steps have determined that third word is anchor positive example, if the corpus of third corpus and the 4th corpus Similarity is less than default second similarity, then it is assumed that third corpus and the semanteme of the 4th expression are different, and then think the 4th word Corresponding term vector is negative example.
It should be noted that the first similarity and the second similarity can be adjusted according to actual needs, Ke Yixiang Together, it can also be different, such as the treatment effect in order to guarantee supervised learning model, need to guarantee anchor positive example and positive example is synonymous Word then the first similarity can be arranged as high as possible, and the value of the second similarity can be identical as the first similarity, The first similarity can be lower than.
In addition, in the embodiment of the present application, it is first determining positive example and anchor positive example, then determines negative example, in practical applications, It can also first determine negative example and anchor positive example, then determine positive example, specific determination process is similar with aforementioned process, does not do herein in detail It states.
Further, corpus similarity can be first time number and second several ratio, and first number is to pass through difference The number of same piece article is searched for and clicked to corpus, and second number is the number that same piece article is searched by different corpus.
In the embodiment of the present application, the second corpus, third corpus and the 4th language can be obtained from clicking in search log Material;Here by taking the second corpus and third corpus as an example, it is assumed that the second corpus and third corpus respectively " where is National Day " and " spring Section where ", then can from click search log in search using the two corpus search for and click article as a result, for example Utilize in " National Day is where " and " Spring Festival is where " article for searching, 18 articles be it is identical, then second number is 18, if utilized in " where is National Day " and " where is the Spring Festival " search and the article clicked, 6 articles be it is identical, then First number is 6, then the corpus similarity of the second corpus and third corpus is then 1/3.
Further, corpus similarity is that the number of same piece article is searched for and clicked by different corpus.
It is understood that also for aforementioned, if corpus similarity is to be searched for and clicked same by different corpus The number of piece article, then the corpus similarity of the second corpus and third corpus is then 18;It in practical applications, can be by 18 conversions For the decimal less than 1 and using decimal as corpus similarity.
Further, there are many ways to selecting multiple groups word from all words of the first corpus, so the application is real It applies and is introduced below example in one of method, referring to Fig. 5, in the embodiment of the present application from all words of the first corpus The embodiment schematic diagram of the method for multiple groups word is selected, this method comprises:
501, a word is randomly choosed from the first corpus as third word.
502, two words for being greater than default third similarity with third word term vector similarity are selected from the first corpus Language is respectively as the second word and the 4th word.
It is understood that if the term vector similarity of the second word and third word is lower, then according to term vector phase It can identify that the degree of correlation of the second word and third word is low like degree, i.e., not be synonym, so selection similarity is low The effect of the corresponding term vector composition positive example of word, negative example and anchor positive example is unobvious, so the embodiment of the present application selects term vector Three big words of similarity as the second word, third word and the 4th word, with guarantee finally determining positive example, negative example and The term vector similarity of anchor positive example is larger, so as to improve the term vector similarity of synonym as much as possible, reduces non-same The term vector similarity of adopted word.
For example, it is assumed that third word is the Spring Festival, the term vector similarity in New Year's Day and the Spring Festival is 0.777, and National Day and the Spring Festival Term vector similarity be 0.775, since the two term vector similarities are very close to if judge the spring according to term vector similarity Whether section, New Year's Day and National Day are synonym, are easy to appear erroneous judgement, therefore can in order to further determine which two in these three words These three words synonymously, can be selected respectively as the second word, third word and the 4th word, once the second word Language, third word and the corresponding term vector of the 4th word can be used as one group of positive example, negative example and anchor positive example, then improving synonym Term vector similarity and reduce the effect of term vector similarity of word non-synonymous will be obvious.
Further, it obtains comprising the second corpus of the second word, the third corpus comprising third word and includes the 4th 4th corpus of word may include:
All corpus comprising the second word are obtained from search click logs, include all corpus and packet of third word All corpus containing the 4th word;
According to preset automatic aligning algorithm to all corpus comprising the second word, include all corpus of third word It is respectively processed with all corpus comprising the 4th word, obtains the second corpus comprising the second word, comprising third word Third corpus and the 4th corpus comprising the 4th word, any two in the second corpus, third corpus and the 4th corpus it is pre- Material, other words in addition to the second word, third word and the 4th word are all the same.
In the embodiment of the present application, it is selected from all corpus by automatic aligning algorithm, so that the second corpus, Three corpus and the 4th corpus belong to same type corpus, it is assumed for example that the second word, third word and the 4th word are respectively The Spring Festival, National Day and New Year, the corpus comprising the Spring Festival are that Happy Spring Festival and the Spring Festival goes home, and the corpus comprising National Day is that National Day is happy fastly National Day military parade, the corpus comprising the New Year are to set off firecrackers the New Year fast happy New Year, after being identified by automatic aligning algorithm, are obtained The second corpus, third corpus and the 4th corpus be then that Happy Spring Festival, the fast happy New Year on National Day is happy, i.e. the second corpus, third Any two in corpus and the 4th corpus are it is anticipated that other words in addition to the second word, third word and the 4th word are equal It is identical, so as to prevent other words outside the second word, third word and the 4th word from influencing the second word, third word The term vector similarity of language and the 4th word.
Further, vector similarity may include Pearson correlation coefficient, Euclidean distance by multiform expression And cosine similarity, the application is by taking cosine similarity as an example, when vector similarity is cosine similarity, according to through supervised learning Term vector after model treatment determines that synonym may include:
Obtain the cosine similarity of two term vectors after supervised learning model treatment;
If cosine similarity is greater than default 4th similarity, two term vectors are synonymously.
Wherein, cosine similarity can be set according to actual needs.
Further, term vector computation model can be with word2vec model.
The construction device of the synonym in the embodiment of the present application is described in detail below, referring to Fig. 6, the application is real Apply example provide a kind of synonym construction device one embodiment structural schematic diagram, comprising:
Acquiring unit 601, for obtaining the first corpus Jing Guo word segmentation processing, the first corpus is to include synonym sentence Set;
First processing units 602 obtain table for handling according to preset term vector computation model the first corpus Show the term vector of each word in the first corpus;
The second processing unit 603, for being handled according to preset supervised learning model term vector, supervised learning mould Type is used to improve the term vector similarity between semantic identical word;
Determination unit 604, for determining synonym according to the term vector after supervised learning model treatment.
Further, in another embodiment of the construction device of synonym, when supervised learning model is based on loss When the deep learning model of function Triplet Loss,
The second processing unit 603 can be used for choosing at least one set of term vector, every group of term vector packet from all term vectors Include three term vectors, and using three term vectors in every group as positive example, negative example and the anchor positive example in deep learning model, The word and positive example that anchor positive example indicates are that the phrase semantic of expression is identical, the word language that the word and negative example that anchor positive example indicates indicate It is adopted different;
Every group of term vector is handled according to deep learning model.
Further, in another embodiment of the construction device of synonym, the second processing unit 603 can be used for,
Select multiple groups word from all words of the first corpus, every group of word includes the second word, third word and the Four words;
Obtain the second corpus, the third corpus comprising third word and the comprising the 4th word the 4th comprising the second word Corpus;
If the corpus similarity of the second corpus and third corpus is greater than default first similarity, the second word will be indicated Term vector will identify the term vector of third word as anchor positive example as positive example;
If the corpus similarity of third corpus and the 4th corpus is less than default second similarity, the 4th word will be indicated Term vector is as negative example;
Every group of term vector is handled according to deep learning model.
Further, in another embodiment of the construction device of synonym, corpus similarity can be first time number With second several ratio, first time number is the number that same piece article is searched for and clicked by different corpus, and second time number is The number of same piece article is searched by different corpus.
Further, in another embodiment of the construction device of synonym, corpus similarity can be to pass through difference Corpus is searched for and clicks the number of same piece article.
Further, in another embodiment of the construction device of synonym, the second processing unit 603 can be used for,
A word is randomly choosed from the first corpus as third word;
Two words point for being greater than default third similarity with third word term vector similarity are selected from the first corpus It Zuo Wei not the second word and the 4th word;
Obtain the second corpus, the third corpus comprising third word and the comprising the 4th word the 4th comprising the second word Corpus;
If the corpus similarity of the second corpus and third corpus is greater than default first similarity, the second word will be indicated Term vector will identify the term vector of third word as anchor positive example as positive example;
If the corpus similarity of third corpus and the 4th corpus is less than default second similarity, the 4th word will be indicated Term vector is as negative example;
Every group of term vector is handled according to deep learning model.
Further, in another embodiment of the construction device of synonym, the second processing unit 603 can be used for,
A word is randomly choosed from the first corpus as third word;
Two words point for being greater than default third similarity with third word term vector similarity are selected from the first corpus It Zuo Wei not the second word and the 4th word.
All corpus comprising the second word are obtained from search click logs, include all corpus and packet of third word All corpus containing the 4th word;
According to preset automatic aligning algorithm to all corpus comprising the second word, include all corpus of third word It is respectively processed with all corpus comprising the 4th word, obtains the second corpus comprising the second word, comprising third word Third corpus and the 4th corpus comprising the 4th word, any two in the second corpus, third corpus and the 4th corpus it is pre- Material, other words in addition to the second word, third word and the 4th word are all the same;
If the corpus similarity of the second corpus and third corpus is greater than default first similarity, the second word will be indicated Term vector will identify the term vector of third word as anchor positive example as positive example;
If the corpus similarity of third corpus and the 4th corpus is less than default second similarity, the 4th word will be indicated Term vector is as negative example;
Every group of term vector is handled according to deep learning model.
Further, in another embodiment of the construction device of synonym, determination unit 604 is used for,
Obtain the cosine similarity of two term vectors after supervised learning model treatment;
If cosine similarity is greater than default 4th similarity, two term vectors are synonymously.
Further, in another embodiment of the construction device of synonym, term vector computation model can be Word2vec model.
Next, the embodiment of the present application also provides a kind of terminal devices, as shown in fig. 7, for ease of description, only showing Part related to the embodiment of the present invention, it is disclosed by specific technical details, please refer to present invention method part.It should It includes mobile phone, tablet computer, personal digital assistant (Personal Digital that attribute information, which shows that device can be, Assistant, PDA), point-of-sale terminal (Point of Sales, POS), any terminal device such as vehicle-mounted computer, with attribute information Show device for for mobile phone:
Fig. 7 shows the part-structure of mobile phone relevant to attribute information provided in an embodiment of the present invention displaying device Block diagram.With reference to Fig. 7, mobile phone include: radio frequency (Radio Frequency, RF) circuit 710, memory 720, input unit 730, Display unit 740, sensor 750, voicefrequency circuit 760, Wireless Fidelity (wireless fidelity, WiFi) module 770, place Manage the components such as device 780 and power supply 790.It will be understood by those skilled in the art that handset structure shown in Fig. 7 is not constituted Restriction to mobile phone may include perhaps combining certain components or different component cloth than illustrating more or fewer components It sets.
It is specifically introduced below with reference to each component parts of the Fig. 7 to mobile phone:
RF circuit 710 can be used for receiving and sending messages or communication process in, signal sends and receivees, particularly, by base station After downlink information receives, handled to processor 780;In addition, the data for designing uplink are sent to base station.In general, RF circuit 710 Including but not limited to antenna, at least one amplifier, transceiver, coupler, low-noise amplifier (Low Noise Amplifier, LNA), duplexer etc..In addition, RF circuit 710 can also be communicated with network and other equipment by wireless communication. Any communication standard or agreement, including but not limited to global system for mobile communications (Global can be used in above-mentioned wireless communication System of Mobile communication, GSM), general packet radio service (General Packet Radio Service, GPRS), CDMA (Code Division MultS19P0917le Access, CDMA), wideband code division multiple access (Wideband Code Division MultS19P0917le Access, WCDMA), long term evolution (Long Term Evolution, LTE), Email, short message service (Short Messaging Service, SMS) etc..
Memory 720 can be used for storing software program and module, and processor 780 is stored in memory 720 by operation Software program and module, thereby executing the various function application and data processing of mobile phone.Memory 720 can mainly include Storing program area and storage data area, wherein storing program area can application journey needed for storage program area, at least one function Sequence (such as sound-playing function, image player function etc.) etc.;Storage data area can be stored to be created according to using for mobile phone Data (such as audio data, phone directory etc.) etc..It, can be in addition, memory 720 may include high-speed random access memory Including nonvolatile memory, for example, at least a disk memory, flush memory device or other volatile solid-states Part.
Input unit 730 can be used for receiving the number or character information of input, and generate with the user setting of mobile phone with And the related key signals input of function control.Specifically, input unit 730 may include that touch panel 731 and other inputs are set Standby 732.Touch panel 731, also referred to as touch screen, collect user on it or nearby touch operation (such as user use The operation of any suitable object or attachment such as finger, stylus on touch panel 731 or near touch panel 731), and root Corresponding attachment device is driven according to preset formula.Optionally, touch panel 731 may include touch detecting apparatus and touch Two parts of controller.Wherein, the touch orientation of touch detecting apparatus detection user, and touch operation bring signal is detected, Transmit a signal to touch controller;Touch controller receives touch information from touch detecting apparatus, and is converted into touching Point coordinate, then gives processor 780, and can receive order that processor 780 is sent and be executed.Furthermore, it is possible to using electricity The multiple types such as resistive, condenser type, infrared ray and surface acoustic wave realize touch panel 731.In addition to touch panel 731, input Unit 730 can also include other input equipments 732.Specifically, other input equipments 732 can include but is not limited to secondary or physical bond One of disk, function key (such as volume control button, switch key etc.), trace ball, mouse, operating stick etc. are a variety of.
Display unit 740 can be used for showing information input by user or be supplied to user information and mobile phone it is various Menu.Display unit 740 may include display panel 741, it is alternatively possible to using liquid crystal display (Liquid Crystal Display, LCD), the forms such as Organic Light Emitting Diode (Organic Light-Emitting Diode, OLED) it is aobvious to configure Show panel 741.Further, touch panel 731 can cover display panel 741, when touch panel 731 detect it is on it or attached After close touch operation, processor 780 is sent to determine the type of touch event, is followed by subsequent processing device 780 according to touch event Type corresponding visual output is provided on display panel 741.Although in Fig. 7, touch panel 731 and display panel 741 It is that the input and input function of mobile phone are realized as two independent components, but in some embodiments it is possible to by touch-control Panel 731 and display panel 741 are integrated and that realizes mobile phone output and input function.
Mobile phone may also include at least one sensor 750, such as optical sensor, motion sensor and other sensors. Specifically, optical sensor may include ambient light sensor and proximity sensor, wherein ambient light sensor can be according to ambient light Light and shade adjust the brightness of display panel 741, proximity sensor can close display panel 741 when mobile phone is moved in one's ear And/or backlight.As a kind of motion sensor, accelerometer sensor can detect (generally three axis) acceleration in all directions Size, can detect that size and the direction of gravity when static, can be used to identify the application of mobile phone posture, (for example horizontal/vertical screen is cut Change, dependent game, magnetometer pose calibrating), Vibration identification correlation function (such as pedometer, tap) etc.;May be used also as mobile phone The other sensors such as gyroscope, barometer, hygrometer, thermometer, the infrared sensor of configuration, details are not described herein.
Voicefrequency circuit 760, loudspeaker 761, microphone 762 can provide the audio interface between user and mobile phone.Audio-frequency electric Electric signal after the audio data received conversion can be transferred to loudspeaker 761, be converted to sound by loudspeaker 761 by road 760 Signal output;On the other hand, the voice signal of collection is converted to electric signal by microphone 762, is turned after being received by voicefrequency circuit 760 It is changed to audio data, then by after the processing of audio data output processor 780, such as another mobile phone is sent to through RF circuit 710, Or audio data is exported to memory 720 to be further processed.
WiFi belongs to short range wireless transmission technology, and mobile phone can help user's transceiver electronics postal by WiFi module 770 Part, browsing webpage and access streaming video etc., it provides wireless broadband internet access for user.Although Fig. 7 is shown WiFi module 770, but it is understood that, and it is not belonging to must be configured into for mobile phone, it can according to need do not changing completely Become in the range of the essence of invention and omits.
Processor 780 is the control centre of mobile phone, using the various pieces of various interfaces and connection whole mobile phone, is led to It crosses operation or executes the software program and/or module being stored in memory 720, and call and be stored in memory 720 Data execute the various functions and processing data of mobile phone, to carry out integral monitoring to mobile phone.Optionally, processor 780 can wrap Include one or more processing units;Optionally, processor 780 can integrate application processor and modem processor, wherein answer With the main processing operation system of processor, user interface and application program etc., modem processor mainly handles wireless communication. It is understood that above-mentioned modem processor can not also be integrated into processor 780.
Mobile phone further includes the power supply 790 (such as battery) powered to all parts, and optionally, power supply can pass through power supply pipe Reason system and processor 780 are logically contiguous, to realize management charging, electric discharge and power managed by power-supply management system Etc. functions.
Although being not shown, mobile phone can also include photographing module, bluetooth module etc., and details are not described herein.
In embodiments of the present invention, processor 780 included by the terminal device is also with the following functions:
The embodiment of the present application also provides a kind of computer storage medium, the computer storage medium is above-mentioned for being stored as Computer software instructions used in the calibration equipment or server of request comprising for executing the building dress for request synonym Set or terminal device designed by program.
The embodiment of the present application also provides a kind of computer program product, which includes computer software It instructs, in the construction method which can be loaded to realize the synonym of above-mentioned request by processor Process.
It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description, The specific work process of device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.
In several embodiments provided herein, it should be understood that disclosed system, device and method can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit It divides, only a kind of logical function partition, there may be another division manner in actual implementation, such as multiple units or components It can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown or The mutual coupling, direct-coupling or communication connection discussed can be through some interfaces, the indirect coupling of device or unit It closes or communicates to connect, can be electrical property, mechanical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.
It, can also be in addition, each functional unit in each embodiment of the application can integrate in one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of software functional units.
If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can store in a computer readable storage medium.Based on this understanding, the technical solution of the application is substantially The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words It embodies, which is stored in a storage medium, including some instructions are used so that a computer Equipment (can be personal computer, server or the network equipment etc.) executes the complete of each embodiment the method for the application Portion or part steps.And storage medium above-mentioned include: USB flash disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic or disk etc. are various can store program The medium of code.
The above, above embodiments are only to illustrate the technical solution of the application, rather than its limitations;Although referring to before Embodiment is stated the application is described in detail, those skilled in the art should understand that: it still can be to preceding Technical solution documented by each embodiment is stated to modify or equivalent replacement of some of the technical features;And these It modifies or replaces, the spirit and scope of each embodiment technical solution of the application that it does not separate the essence of the corresponding technical solution.

Claims (10)

1. a kind of construction method of synonym characterized by comprising
The first corpus Jing Guo word segmentation processing is obtained, first corpus is the set comprising synonym sentence;
First corpus is handled according to preset term vector computation model, obtains indicating each word in first corpus The term vector of language;
The term vector is handled according to preset supervised learning model, the supervised learning model is for improving semantic phase With the term vector similarity between word;
Synonym is determined according to the term vector after the supervised learning model treatment.
2. the construction method of synonym according to claim 1, which is characterized in that the supervised learning model is based on damage Lose the deep learning model of function;
Carrying out processing to the term vector according to preset supervised learning model includes:
At least one set of term vector is chosen from all term vectors, every group of term vector includes three term vectors, and by three in every group A term vector respectively as positive example, negative example and the anchor positive example in the deep learning model, word that the anchor positive example indicates and The positive example is that the phrase semantic of expression is identical, and the phrase semantic that the word and the negative example that the anchor positive example indicates indicate is not Together;
Every group of term vector is handled according to the deep learning model.
3. the construction method of synonym according to claim 2, which is characterized in that choose at least one from all term vectors Group term vector, every group of term vector include three term vectors, and using three term vectors in every group as the deep learning Positive example, negative example and anchor positive example in model include:
Select multiple groups word from all words of first corpus, every group of word includes the second word, third word and the Four words;
It obtains the second corpus comprising second word, the third corpus comprising the third word and includes the 4th word 4th corpus of language;
If the corpus similarity of second corpus and the third corpus is greater than default first similarity, described the will be indicated The term vector of two words will identify the term vector of the third word as anchor positive example as positive example;
If the corpus similarity of the third corpus and the 4th corpus is less than default second similarity, described the will be indicated The term vector of four words is as negative example.
4. the construction method of synonym according to claim 3, which is characterized in that the corpus similarity is first number With second several ratio, first number is the number that same piece article is searched for and clicked by different corpus, described Two numbers are the number that same piece article is searched by different corpus.
5. the construction method of synonym according to claim 3, which is characterized in that the corpus similarity is to pass through difference Corpus is searched for and clicks the number of same piece article.
6. the construction method of synonym according to claim 3, which is characterized in that described from all of first corpus Multiple groups word is selected in word, every group of word includes that the second word, third word and the 4th word include:
A word is randomly choosed from first corpus as third word;
Two words for being greater than default third similarity with the third word term vector similarity are selected from first corpus Language is respectively as the second word and the 4th word.
7. the construction method of synonym according to claim 3, which is characterized in that described obtain includes second word The second corpus, the third corpus comprising the third word and the 4th corpus comprising the 4th word include:
All corpus, all corpus comprising the third word comprising second word are obtained from search click logs With all corpus comprising the 4th word;
It is all to all corpus comprising second word, comprising the third word according to preset automatic aligning algorithm Corpus and all corpus comprising the 4th word are respectively processed, obtain the second corpus comprising second word, Third corpus comprising the third word and the 4th corpus comprising the 4th word, second corpus, the third Any two in corpus and the 4th corpus are it is anticipated that remove second word, the third word and the 4th word Other words outside language are all the same.
8. the construction method of synonym as claimed in any of claims 1 to 7, which is characterized in that according to warp Term vector after supervised learning model treatment determines that synonym includes:
Obtain the cosine similarity of two term vectors after the supervised learning model treatment;
If the cosine similarity is greater than default 4th similarity, described two term vectors are synonymously.
9. the construction method of synonym as claimed in any of claims 1 to 7, which is characterized in that the term vector Computation model is word2vec model.
10. a kind of construction device of synonym characterized by comprising
Acquiring unit, for obtaining the first corpus Jing Guo word segmentation processing, first corpus is the collection comprising synonym sentence It closes;
First processing units are indicated for being handled according to preset term vector computation model first corpus The term vector of each word in first corpus;
The second processing unit, for being handled according to preset supervised learning model the term vector, the supervised learning Model is used to improve the term vector similarity between semantic identical word;
Determination unit, for determining synonym according to the term vector after the supervised learning model treatment.
CN201910570705.XA 2019-06-26 2019-06-26 A kind of construction method and relevant apparatus of synonym Pending CN110263347A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910570705.XA CN110263347A (en) 2019-06-26 2019-06-26 A kind of construction method and relevant apparatus of synonym

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910570705.XA CN110263347A (en) 2019-06-26 2019-06-26 A kind of construction method and relevant apparatus of synonym

Publications (1)

Publication Number Publication Date
CN110263347A true CN110263347A (en) 2019-09-20

Family

ID=67922663

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910570705.XA Pending CN110263347A (en) 2019-06-26 2019-06-26 A kind of construction method and relevant apparatus of synonym

Country Status (1)

Country Link
CN (1) CN110263347A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114722802A (en) * 2022-04-07 2022-07-08 平安科技(深圳)有限公司 Word vector generation method and device, computer equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114722802A (en) * 2022-04-07 2022-07-08 平安科技(深圳)有限公司 Word vector generation method and device, computer equipment and storage medium
CN114722802B (en) * 2022-04-07 2024-01-30 平安科技(深圳)有限公司 Word vector generation method, device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN107943860B (en) Model training method, text intention recognition method and text intention recognition device
CN109241431B (en) Resource recommendation method and device
US20170091335A1 (en) Search method, server and client
CN109074354B (en) Method and terminal equipment for displaying candidate items
US9241242B2 (en) Information recommendation method and apparatus
CN104239535A (en) Method and system for matching pictures with characters, server and terminal
CN108364644A (en) A kind of voice interactive method, terminal and computer-readable medium
CN110019825B (en) Method and device for analyzing data semantics
CN104217717A (en) Language model constructing method and device
CN111177371B (en) Classification method and related device
CN106294308B (en) Named entity identification method and device
CN109033156B (en) Information processing method and device and terminal
CN104123937A (en) Method, device and system for reminding setting
CN110276010B (en) Weight model training method and related device
CN109543014B (en) Man-machine conversation method, device, terminal and server
CN104281568B (en) Paraphrasing display method and paraphrasing display device
CN107885718B (en) Semantic determination method and device
US20150310119A1 (en) Systems and Methods for Filtering Microblogs
CN110597957B (en) Text information retrieval method and related device
CN112749252A (en) Text matching method based on artificial intelligence and related device
CN104424324B (en) The method and device of locating list item in list element
CN103401910B (en) Recommendation method, server, terminal and system
CN112925878B (en) Data processing method and device
CN112328783A (en) Abstract determining method and related device
CN111553163A (en) Text relevance determining method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination