WO2020244065A1 - Procédé, appareil et dispositif de définition de vecteur de caractère basés sur l'intelligence artificielle et support de stockage - Google Patents
Procédé, appareil et dispositif de définition de vecteur de caractère basés sur l'intelligence artificielle et support de stockage Download PDFInfo
- Publication number
- WO2020244065A1 WO2020244065A1 PCT/CN2019/102462 CN2019102462W WO2020244065A1 WO 2020244065 A1 WO2020244065 A1 WO 2020244065A1 CN 2019102462 W CN2019102462 W CN 2019102462W WO 2020244065 A1 WO2020244065 A1 WO 2020244065A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- vector
- word
- target
- word vector
- target word
- Prior art date
Links
- 239000013598 vector Substances 0.000 title claims abstract description 531
- 238000013473 artificial intelligence Methods 0.000 title claims abstract description 65
- 238000000034 method Methods 0.000 title claims abstract description 41
- 230000011218 segmentation Effects 0.000 claims abstract description 46
- 238000004422 calculation algorithm Methods 0.000 claims description 26
- 239000011159 matrix material Substances 0.000 claims description 20
- 230000015654 memory Effects 0.000 claims description 20
- 238000004364 calculation method Methods 0.000 claims description 9
- 238000012549 training Methods 0.000 claims description 9
- 230000007787 long-term memory Effects 0.000 claims description 8
- 230000006403 short-term memory Effects 0.000 claims description 8
- 238000007477 logistic regression Methods 0.000 claims description 6
- 238000004590 computer program Methods 0.000 claims description 5
- 238000000605 extraction Methods 0.000 claims description 5
- 238000013507 mapping Methods 0.000 claims description 3
- 230000006870 function Effects 0.000 description 7
- 238000010586 diagram Methods 0.000 description 6
- 230000008569 process Effects 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 5
- 238000012545 processing Methods 0.000 description 5
- 238000003058 natural language processing Methods 0.000 description 4
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000012935 Averaging Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000007704 transition Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 239000000969 carrier Substances 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Definitions
- This application relates to the field of word segmentation models, and in particular to an artificial intelligence-based character vector definition method, device, equipment and storage medium.
- NLP natural language processing
- word vector technology transforms words into dense vectors, and for similar words, their corresponding word vectors are also similar.
- word vectors and word vectors are input as features of deep learning models. Therefore, the effect of the final model largely depends on the effect of word vectors and word vectors.
- word vectors and word vectors are relatively independent.
- entity recognition most of them use word vectors.
- word vectors When doing text classification and topic extraction, most of them use word vectors for recognition. .
- the disadvantage of word vectors is that they are large in number and in entity extraction, small words are prone to be wrong and big words are wrong.
- the disadvantage of the word vector is that a single word may have completely irrelevant meanings, such as "old” in “old man” and “old” in "Laozi". His word vector can only represent one meaning and we understand Does not meet.
- the inventor realizes that in the existing solution, for the same word, when applied to a word, the meaning of a single word may be completely irrelevant to the meaning of the word, and the word vector has a single meaning.
- This application provides an artificial intelligence-based character vector definition method, device, equipment and storage medium, which are used to take a single character as the smallest structure, consider the combination of character vector and word vector, and assign each character to a different word Multiple different meanings increase the accuracy of the meaning of the word vector in the sentence.
- the first aspect of the embodiments of the present application provides an artificial intelligence-based character vector definition method, including: acquiring a target text, the target text including a Chinese sentence that needs to be segmented; and segmenting the target text to obtain multiple words; Generate multiple corresponding target word vectors according to the multiple words; generate a target word vector according to the multiple target word vectors and a preset weighting strategy, each of the multiple words corresponding to the multiple target word vectors Both contain the words corresponding to the target word vector; each target word vector is input as a model parameter into the long and short-term memory network LSTM and the conditional random field CRF model, and entity recognition of the Chinese sentence is performed to generate predicted word segmentation.
- the generating multiple target word vectors corresponding to the multiple words includes: inputting the multiple words into a preset algorithm In the model; each word is mapped into a K-dimensional word vector, where K is an integer greater than 0; the distance between each word vector is calculated; each word vector is determined according to the distance between each word vector According to the semantic similarity between each word vector, the vector with the highest semantic similarity with the target word among the multiple words is determined as the target word vector; the multiple target word vectors are determined, each Each target word vector corresponds to a word.
- the method before the inputting the multiple words into the preset algorithm model, the method further includes: randomly generating a word vector Matrix, each row corresponds to a word vector; determine a target word in the word vector matrix, and extract the word vectors of surrounding words from the word vector matrix; calculate the mean vector of the word vectors of the surrounding words; The mean vector is input into a preset logistic regression model for training; a preset algorithm model is generated, and the probability vector output by the preset algorithm model matches the one-hot encoding vector of the target word.
- the calculating the distance between each word vector includes: determining the first vector and the second vector in the word vector; calculating The cosine value between the first vector and the second vector satisfies the formula: Among them, D 1 and D 2 represent the first vector and the second vector, respectively, W 1k represents the weight of the first vector, W 2k represents the weight of the second vector, both the first vector and the second vector include N eigenvalues, 1 ⁇ k ⁇ N; the cosine value is determined as the distance between the first vector and the second vector.
- the target word vector is generated according to the multiple target word vectors and a preset weight strategy, and the multiple target word vectors correspond to Each of the multiple words includes the word corresponding to the target word vector, including: obtaining a preset weight strategy, the preset weight strategy including the weight value of each word vector; determining the multiple target word vectors The target weight value of each word vector in the; generating a target word vector according to the multiple target word vectors and each target weight value of the multiple target word vectors.
- the target word vector is generated according to the multiple target word vectors and a preset weight strategy, and the multiple target word vectors correspond to After each of the multiple words contains the word corresponding to the target word vector, the method further includes: inputting each target word vector as a model parameter into the long and short-term memory network LSTM and the conditional random field CRF model, The Chinese sentence performs entity recognition to generate predicted word segmentation.
- each target word vector is input as a model parameter into the long and short-term memory network LSTM and the conditional random field CRF model
- the Chinese Sentence entity recognition to generate predictive word segmentation includes: inputting each target word vector as a model parameter into the LSTM and CRF models; determining the position of each target word vector in the word space; according to each target word vector in the word space Perform entity recognition on the Chinese sentence at the position in to generate predicted word segmentation.
- the second aspect of the embodiments of the present application provides an artificial intelligence-based character vector definition device, including: an acquisition unit for acquiring a target text, the target text including a Chinese sentence that needs to be segmented; a word segmentation unit for comparing The target text is segmented to obtain multiple words; the first generating unit is used to generate multiple corresponding target word vectors according to the multiple words; the second generating unit is used to generate multiple target word vectors according to the multiple target word vectors and the prediction
- the weighting strategy for setting generates a target word vector, and each word in the plurality of words corresponding to the plurality of target word vectors contains a word corresponding to the target word vector.
- the first generating unit includes: a first input module, configured to input the multiple words into a preset algorithm model; and a mapping module , Used to map each word into a K-dimensional word vector, where K is an integer greater than 0; the first calculation module is used to calculate the distance between each word vector; the first determination module is used to The distance between each word vector is used to determine the semantic similarity between each word vector; the second determining module is used to determine the semantic similarity between each word vector and the target among the multiple words The vector with the highest semantic similarity of words is the target word vector; the third determining module is used to determine multiple target word vectors, and each target word vector corresponds to a word.
- the first generating unit further includes: a first generating module, configured to randomly generate a word vector matrix, and each row corresponds to a word vector; An extraction module, used to determine a target word in the word vector matrix, and extract the word vectors of surrounding words from the word vector matrix; a second calculation module, used to calculate the mean vector of the word vectors of the surrounding words;
- the training module is used to input the mean vector into a preset logistic regression model for training;
- the second generation module is used to generate a preset algorithm model, the probability vector output by the preset algorithm model is the same as the Match the one-hot encoding vector of the target word.
- the first calculation module is specifically configured to: determine the first vector and the second vector in the word vector; calculate the first vector and The cosine value between the second vectors satisfies the formula: Among them, D 1 and D 2 represent the first vector and the second vector, respectively, W 1k represents the weight of the first vector, W 2k represents the weight of the second vector, both the first vector and the second vector include N eigenvalues, 1 ⁇ k ⁇ N; the cosine value is determined as the distance between the first vector and the second vector.
- the second generating unit is specifically configured to: obtain a preset weight strategy, where the preset weight strategy includes the weight value of each word vector Determine the target weight value of each word vector in the multiple target word vectors; generate a target word vector according to the multiple target word vectors and each target weight value in the multiple target word vectors.
- the artificial intelligence-based word vector definition device further includes: a third generating unit, configured to input each target word vector as a model parameter into the length In the temporal memory network LSTM and the conditional random field CRF model, entity recognition of the Chinese sentence is performed to generate predicted word segmentation.
- a third generating unit configured to input each target word vector as a model parameter into the length In the temporal memory network LSTM and the conditional random field CRF model, entity recognition of the Chinese sentence is performed to generate predicted word segmentation.
- the third generation unit includes: a second input module, configured to input each target word vector as a model parameter into the LSTM and CRF models;
- the fourth determining module is used to determine the position of each target word vector in the word space;
- the recognition generation module is used to perform entity recognition on the Chinese sentence according to the position of each target word vector in the word space to generate Predict word segmentation.
- the third aspect of the embodiments of the present application provides an artificial intelligence-based word vector definition device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor.
- the processor When the computer program is executed, the artificial intelligence-based word vector definition method described in any of the above embodiments is implemented.
- the fourth aspect of the embodiments of the present application provides a non-volatile computer-readable storage medium, including instructions, which when run on a computer, cause the computer to execute the artificial intelligence-based The steps of the word vector definition method.
- the target text is obtained, and the target text includes Chinese sentences that need to be segmented; the target text is segmented to obtain multiple words; multiple corresponding target word vectors are generated according to the multiple words; The target word vector and the preset weighting strategy generate a target word vector, and each of the multiple words corresponding to the multiple target word vectors contains a word corresponding to the target word vector.
- a single word is taken as the minimum structure, and the combination of word vector and word vector is considered, and each word is given multiple different meanings in different words, which increases the accuracy of the meaning of the word vector in the sentence, and then Improve the efficiency of Chinese word segmentation.
- FIG. 1 is a schematic diagram of an embodiment of a method for defining a word vector based on artificial intelligence in an embodiment of the application;
- FIG. 2 is a schematic diagram of another embodiment of a method for defining a word vector based on artificial intelligence in an embodiment of the application;
- FIG. 3 is a schematic diagram of an embodiment of an artificial intelligence-based word vector definition device in an embodiment of the application
- FIG. 4 is a schematic diagram of another embodiment of an artificial intelligence-based word vector definition device in an embodiment of this application.
- Fig. 5 is a schematic diagram of an embodiment of an artificial intelligence-based word vector definition device in an embodiment of the application.
- This application provides an artificial intelligence-based character vector definition method, device, equipment and storage medium, which are used to take a single character as the smallest structure, consider the combination of character vector and word vector, and assign each character to a different word Multiple different meanings increase the accuracy of the meaning of each word in the sentence and improve the efficiency of Chinese word segmentation.
- the flowchart of the artificial intelligence-based word vector definition method specifically includes:
- target text where the target text includes Chinese sentences that need to be segmented.
- the server obtains the target text, and the target text includes Chinese sentences that need to be segmented.
- the Chinese sentence may be "I like Apple”, or “Engineer Fu hits the computer”, etc.
- the embodiment of the application uses “Engineer Fu hits the computer” as the Chinese sentence for description.
- the execution subject of this application may be a word vector definition device based on artificial intelligence, or a terminal or a server, which is not specifically limited here.
- This application takes the server as the execution subject as an example for description.
- the server uses preset word segmentation tools, such as the Chinese word segmentation tool, HanLP, etc., to segment the target text to obtain multiple words. For example, if the target text is “I am an algorithm engineer”, then five words such as “I”, “Yes”, “one”, “algorithm”, and “engineer” can be obtained respectively.
- preset word segmentation tools such as the Chinese word segmentation tool, HanLP, etc.
- the target text "Engineer Fu hits the computer” is segmented, using 3-Gram or 2-Gram segmentation, and it is found that the text has “engineer” and “Chengshi” in front of the text, followed by " ⁇ ” and “ ⁇ ”.
- the server defines multiple target word vectors corresponding to multiple words. Specifically, the server inputs multiple words into a preset algorithm model; the server maps each word into a K-dimensional word vector, where K is an integer greater than 0; the server calculates the distance between each word vector; The server determines the semantic similarity between each word vector according to the distance between each word vector; the server determines the vector with the highest semantic similarity to the target word among multiple words according to the semantic similarity between each word vector Target word vector: The server determines multiple target word vectors, and each target word vector corresponds to a word.
- the server inputs the four words identified above into the preset model through the preset model, and matches the two words “engineer” and “master” in the preset model, and determines the word vector "engineer” “And the word vector "Master”.
- the generated preset model needs to ensure the validity of each word vector. If there are only “engineering” and “master” in the word vector, when we encounter a sentence "engineering is a good job", the result predicted by the preset model can only match the existing word vector, if Continuing to do entity recognition, you will get that "engineer (B) division (E) is” split the whole “engineer”, although the results can be obtained, but the information gap is split.
- target text generally refers to various machine-readable records.
- the text is represented by D (Document), and the feature item is represented by T (Term).
- T refers to the basic language unit that is present in document D and can represent the content of the document. It is mainly composed of words or phrases.
- the text can be represented by a feature set Is D(T1, T2,..., Tn), where Tk is the characteristic item, and 1 ⁇ k ⁇ N.
- Tk is the characteristic item
- 104 Generate a target word vector according to a plurality of target word vectors and a preset weighting strategy, and each of the plurality of words corresponding to the plurality of target word vectors includes a word corresponding to the target word vector.
- the server generates a target word vector according to a plurality of corresponding word vectors and a preset weighting strategy, where each word in the plurality of words corresponding to the plurality of word vectors includes a word corresponding to the word vector. For example, to define the word vector for the word " ⁇ ", the current word vector of " ⁇ " is (teacher's word vector + master's word vector + engineer's word vector)/3, and the target word vector of " ⁇ " is obtained.
- the target text is obtained, and the target text includes Chinese sentences that need to be segmented; the target text is segmented to obtain multiple words; multiple corresponding target word vectors are generated according to the multiple words; according to multiple target word vectors and predictions
- the weighting strategy of setting generates the target word vector, and each word in the multiple words corresponding to the multiple target word vectors contains the word corresponding to the target word vector. Taking a single word as the minimum structure, considering the combination of word vector and word vector, assigning multiple different meanings to each word in different words, increasing the accuracy of the meaning of the word vector in the sentence.
- FIG. 2 another flowchart of the artificial intelligence-based word vector definition method provided by the embodiment of the present application, which specifically includes:
- the server obtains the target text, and the target text includes Chinese sentences that need to be segmented.
- the Chinese sentence may be "I like Apple”, or “Engineer Fu hits the computer”, etc.
- the embodiment of the application uses “Engineer Fu hits the computer” as the Chinese sentence for description.
- the execution subject of this application may be a word vector definition device based on artificial intelligence, or a terminal or a server, which is not specifically limited here.
- This application takes the server as the execution subject as an example for description.
- the server uses a preset word segmentation tool, such as a Chinese word segmentation tool, HanLP, etc., to segment the target text to obtain multiple words. For example, if the target text is “I am an algorithm engineer”, then five words such as “I”, “Yes”, “one”, “algorithm”, and “engineer” can be obtained respectively.
- a preset word segmentation tool such as a Chinese word segmentation tool, HanLP, etc.
- the target text "Engineer Fu hits the computer” is segmented, using 3-Gram or 2-Gram segmentation, and it is found that the text has “engineer” and “Chengshi” in front of the text, followed by " ⁇ ” and “ ⁇ ”.
- the server defines multiple target word vectors corresponding to multiple words. Specifically, the server inputs multiple words into a preset algorithm model; the server maps each word into a K-dimensional word vector, where K is an integer greater than 0; the server calculates the distance between each word vector; The server determines the semantic similarity between each word vector according to the distance between each word vector; the server determines the vector with the highest semantic similarity to the target word among multiple words according to the semantic similarity between each word vector Target word vector: The server determines multiple target word vectors, and each target word vector corresponds to a word.
- the server inputs the four words identified above into the preset model through the preset model, and matches the two words “engineer” and “master” in the preset model, and determines the word vector "engineer” “And the word vector "Master”.
- the generated preset model needs to ensure the validity of each word vector. If there are only “engineering” and “master” in the word vector, when we encounter a sentence "engineering is a good job", the result predicted by the preset model can only match the existing word vector, if Continuing to do entity recognition, you will get that "engineer (B) division (E) is” split the whole “engineer”, although the results can be obtained, but the information gap is split.
- target text generally refers to various machine-readable records.
- the text is represented by D (Document), and the feature item is represented by T (Term).
- T refers to the basic language unit that is present in the document D and can represent the content of the document. It is mainly composed of words or phrases.
- the text can be represented by a feature set Is D(T1, T2,..., Tn), where Tk is the feature item. For example, there are four feature items of a, b, c, d in a document, then the document can be expressed as D(a, b, c, d). For a text containing n feature items, each feature item is usually given a certain weight to indicate its importance.
- Wk is the weight of Tk, 1 ⁇ k ⁇ N.
- the vector of the text can be expressed as D(30, 20, 20, 10).
- the content correlation between two texts D1 and D2 Sim(D1, D2) is usually expressed by the cosine value of the angle between the vectors.
- the feature items of text D1 are a, b, c, d, and the weights are 30, 20, 20, 10, respectively, and the feature items of text C1 are a, c, d, e, and the weights are 40, 30, respectively.
- the vector of D1 is represented as D1(30,20,20,10,0)
- the vector of C1 is represented as C1(40,0,30,20,10)
- the calculated text D1 and text C1 The similarity is 0.86.
- each of the multiple words corresponding to the multiple target word vectors includes a word corresponding to the target word vector.
- the server generates a target word vector according to a plurality of corresponding word vectors and a preset weighting strategy, where each word in the plurality of words corresponding to the plurality of word vectors includes a word corresponding to the word vector.
- the current word vector of " ⁇ " is (teacher's word vector + master's word vector + engineer's word vector)/3, and the target word vector of " ⁇ " is obtained.
- each target word vector as a model parameter into the long and short-term memory network LSTM and the conditional random field CRF model, and perform entity recognition on the Chinese sentence to generate predicted word segmentation.
- the server inputs each target word vector as a model parameter into the long and short-term memory network LSTM and the conditional random field CRF model, and performs entity recognition on the Chinese sentence to generate predictive word segmentation.
- the server inputs each target word vector as a model parameter into the LSTM and CRF models; the server determines the position of each target word vector in the word space; the server performs an analysis of the Chinese character according to the position of each target word vector in the word space Sentences perform entity recognition and generate predictive word segmentation.
- entity recognition is performed on the Chinese sentence according to the position of each target word vector in the character space, and the process of generating predicted word segmentation specifically includes:
- the preset formula is: Among them, P is the score matrix mapped from the output of the bidirectional LSTM after the fully connected layer, P i,j represents the score of the jth label corresponding to the i-th word in the Chinese sentence, and its dimension is n ⁇ k, and k is the label
- A represents the transition matrix of the word segmentation label
- a i, j represents the transition score between label i and label j, 1 ⁇ i ⁇ k, 1 ⁇ j ⁇ k; determine the probability p of the word segmentation label sequence, which satisfies the formula:
- Calculate the loss function of probability p Among them, Y X represents all the label sequences of Chinese sentence X; the label with the highest score is determined according to the loss function as the predicted word segmentation
- the target text is obtained, and the target text includes Chinese sentences that need to be segmented; the target text is segmented to obtain multiple words; multiple corresponding target word vectors are generated according to the multiple words; according to multiple target word vectors and predictions
- the weighting strategy of setting generates the target word vector, and each word in the multiple words corresponding to the multiple target word vectors contains the word corresponding to the target word vector.
- An embodiment of the vector definition device includes:
- the obtaining unit 301 is configured to obtain a target text, the target text including a Chinese sentence that needs to be segmented;
- the word segmentation unit 302 is configured to segment the target text to obtain multiple words
- the first generating unit 303 is configured to generate multiple corresponding target word vectors according to the multiple words
- the second generating unit 304 is configured to generate a target word vector according to the multiple target word vectors and a preset weighting strategy, each of the multiple words corresponding to the multiple target word vectors contains the target word vector The corresponding word.
- the target text is obtained, and the target text includes Chinese sentences that need to be segmented; the target text is segmented to obtain multiple words; multiple corresponding target word vectors are generated according to the multiple words; according to multiple target word vectors and predictions
- the weighting strategy of setting generates the target word vector, and each word in the multiple words corresponding to the multiple target word vectors contains the word corresponding to the target word vector. Taking a single word as the minimum structure, considering the combination of word vector and word vector, assigning multiple different meanings to each word in different words, increasing the accuracy of the meaning of the word vector in the sentence.
- another embodiment of the device for defining a word vector based on artificial intelligence in the embodiment of the present application includes:
- the obtaining unit 301 is configured to obtain a target text, the target text including a Chinese sentence that needs to be segmented;
- the word segmentation unit 302 is configured to segment the target text to obtain multiple words
- the first generating unit 303 is configured to generate multiple corresponding target word vectors according to the multiple words
- the second generating unit 304 is configured to generate a target word vector according to the multiple target word vectors and a preset weighting strategy, each of the multiple words corresponding to the multiple target word vectors contains the target word vector The corresponding word.
- the first generating unit 303 includes:
- the first input module 30301 is configured to input the multiple words into a preset algorithm model
- the mapping module 30302 is used to map each word into a K-dimensional word vector, where K is an integer greater than 0;
- the first calculation module 30303 is used to calculate the distance between each word vector; the first determination module is used to determine the semantic similarity between each word vector according to the distance between each word vector;
- the second determining module 30304 is configured to determine, according to the semantic similarity between each word vector, the vector with the highest semantic similarity to the target word among the multiple words as the target word vector;
- the third determining module 30305 is configured to determine multiple target word vectors, and each target word vector corresponds to a word.
- the first generating unit 303 further includes:
- the first generating module 30306 is used to randomly generate a word vector matrix, and each row corresponds to a word vector;
- the extraction module 30307 is configured to determine a target word in the word vector matrix, and extract the word vectors of surrounding words from the word vector matrix;
- the second calculation module 30308 is configured to calculate the mean vector of the word vectors of the surrounding words
- the training module 30309 is configured to input the mean vector into a preset logistic regression model for training
- the second generation module 30310 is configured to generate a preset algorithm model, and the probability vector output by the preset algorithm model matches the one-hot encoding vector of the target word.
- the first calculation module 30303 is specifically configured to:
- D 1 and D 2 represent the first vector and the second vector, respectively, W 1k represents the weight of the first vector, W 2k represents the weight of the second vector, both the first vector and the second vector include N eigenvalues, 1 ⁇ k ⁇ N; the cosine value is determined as the distance between the first vector and the second vector.
- the second generating unit 304 is specifically configured to:
- the preset weight strategy includes the weight value of each word vector; determine the target weight value of each word vector in the plurality of target word vectors; according to the plurality of target word vectors and Each target weight value in the multiple target word vectors generates a target word vector.
- the word vector definition device based on artificial intelligence further includes:
- the third generating unit 305 is configured to input each target word vector as a model parameter into the long and short-term memory network LSTM and the conditional random field CRF model, and perform entity recognition on the Chinese sentence to generate predicted word segmentation.
- the third generating unit 305 specifically includes:
- the second input module 3051 is used to input each target word vector as a model parameter into the LSTM and CRF models;
- the fourth determining module 3052 is used to determine the position of each target word vector in the word space
- the recognition generating module 3053 is configured to perform entity recognition on the Chinese sentence according to the position of each target word vector in the character space to generate predicted word segmentation.
- the target text is obtained, and the target text includes Chinese sentences that need to be segmented; the target text is segmented to obtain multiple words; multiple corresponding target word vectors are generated according to the multiple words; according to multiple target word vectors and predictions
- the weighting strategy of setting generates the target word vector, and each word in the multiple words corresponding to the multiple target word vectors contains the word corresponding to the target word vector.
- FIG. 5 is a schematic structural diagram of an artificial intelligence-based word vector definition device provided by an embodiment of the present application.
- the artificial intelligence-based word vector definition device 500 may have relatively large differences due to different configurations or performance, and may include one or One or more processors (central processing units, CPU) 501 (for example, one or more processors) and a memory 509, one or more storage media 508 for storing application programs 507 or data 506 (for example, one or one storage device with a large amount of ).
- the memory 509 and the storage medium 508 may be short-term storage or persistent storage.
- the program stored in the storage medium 508 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the artificial intelligence-based word vector definition device.
- the processor 501 may be configured to communicate with the storage medium 508, and execute a series of instruction operations in the storage medium 508 on the artificial intelligence-based word vector definition device 500.
- the artificial intelligence-based word vector definition device 500 may also include one or more power sources 502, one or more wired or wireless network interfaces 503, one or more input and output interfaces 504, and/or, one or more operating systems 505 , Such as Windows Serve, Mac OS X, Unix, Linux, FreeBSD and so on.
- operating systems 505 Such as Windows Serve, Mac OS X, Unix, Linux, FreeBSD and so on.
- the processor 501 can perform the functions of the acquiring unit 301, the word segmentation unit 302, the first generating unit 303, the second generating unit 304, and the third generating unit 305 in the foregoing embodiment.
- the processor 501 is the control center of the artificial intelligence-based word vector definition device, and can perform processing according to the set artificial intelligence-based word vector definition method.
- the processor 501 uses various interfaces and lines to connect the various parts of the entire artificial intelligence-based word vector definition device, and by running or executing software programs and/or modules stored in the memory 509, and calling data stored in the memory 509, Execute various functions and processing data of the device based on artificial intelligence word vector definition, and convert unreadable labels in the message domain into readable labels, thereby realizing rapid identification of application scenarios in the message.
- the storage medium 508 and the memory 509 are both carriers for storing data. In the embodiment of the present application, the storage medium 508 may refer to an internal memory with a small storage capacity but high speed, and the storage 509 may have a large storage capacity but a slow storage speed. External memory.
- the memory 509 can be used to store software programs and modules, and the processor 501 executes various functional applications and data processing of the word vector definition device 500 based on artificial intelligence by running the software programs and modules stored in the memory 509.
- the memory 509 may mainly include a storage program area and a storage data area.
- the storage program area may store an operating system and an application program required by at least one function (for example, a target word vector is generated according to multiple target word vectors and a preset weight strategy)
- the storage data area can store data (such as multiple target word vectors) created according to the use of artificial intelligence-based word vector definition equipment.
- the memory 509 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other non-volatile solid-state storage devices.
- a non-volatile memory such as at least one magnetic disk storage device, a flash memory device, or other non-volatile solid-state storage devices.
- the artificial intelligence-based word vector definition method program and the received data stream provided in the embodiment of the present application are stored in the memory, and the processor 501 is called from the memory 509 when needed.
- the present application also provides a non-volatile computer-readable storage medium, including instructions, which when run on a computer, cause the computer to execute the following steps of the artificial intelligence-based word vector definition method:
- Target text includes Chinese sentences that require word segmentation
- a target word vector is generated according to the multiple target word vectors and a preset weighting strategy, and each of the multiple words corresponding to the multiple target word vectors includes a word corresponding to the target word vector.
- the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
- the computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium.
- the computer instructions may be transmitted from a website, computer, server, or data center. Transmission to another website site, computer, server or data center via wired (such as coaxial cable, optical fiber, twisted pair) or wireless (such as infrared, wireless, microwave, etc.).
- the computer-readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a server or data center integrated with one or more available media.
- the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, an optical disc), or a semiconductor medium (for example, a solid state disk (SSD)).
- the disclosed system, device, and method may be implemented in other ways.
- the device embodiments described above are only illustrative.
- the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or It can be integrated into another system, or some features can be ignored or not implemented.
- the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
- the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
La présente invention relève du domaine de l'intelligence artificielle, et en particulier elle relève du domaine des modèles de segmentation en mots. L'invention concerne un procédé, un appareil et un dispositif de définition de vecteur de caractère basés sur l'intelligence artificielle, et un support de stockage, qui sont utilisés pour donner une pluralité de significations différentes à chaque caractère dans différents mots en prenant un caractère unique en tant que structure minimale, ce qui permet d'améliorer la précision de la signification d'un vecteur de caractère dans une phrase. Le procédé de la présente invention consiste à : acquérir un texte cible, le texte cible comprenant un énoncé en chinois devant être soumis à une segmentation en mots ; effectuer une segmentation en mots sur le texte cible pour obtenir une pluralité de mots ; générer une pluralité de vecteurs de mots cibles correspondants en fonction de la pluralité de mots ; et générer des vecteurs de caractères cibles selon la pluralité de vecteurs de mots cibles et d'une politique de poids prédéfinie, chaque mot parmi la pluralité de mots correspondant à la pluralité de vecteurs de mots cibles comprenant des caractères correspondant aux vecteurs de caractères cibles.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910483399.6A CN110298035B (zh) | 2019-06-04 | 2019-06-04 | 基于人工智能的字向量定义方法、装置、设备及存储介质 |
CN201910483399.6 | 2019-06-04 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020244065A1 true WO2020244065A1 (fr) | 2020-12-10 |
Family
ID=68027590
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/102462 WO2020244065A1 (fr) | 2019-06-04 | 2019-08-26 | Procédé, appareil et dispositif de définition de vecteur de caractère basés sur l'intelligence artificielle et support de stockage |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110298035B (fr) |
WO (1) | WO2020244065A1 (fr) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112861531A (zh) * | 2021-03-22 | 2021-05-28 | 北京小米移动软件有限公司 | 分词方法、装置、存储介质和电子设备 |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110928936B (zh) * | 2019-10-18 | 2023-06-16 | 平安科技(深圳)有限公司 | 基于强化学习的信息处理方法、装置、设备和存储介质 |
CN110797005B (zh) * | 2019-11-05 | 2022-06-10 | 百度在线网络技术(北京)有限公司 | 韵律预测方法、装置、设备和介质 |
CN111079442B (zh) * | 2019-12-20 | 2021-05-18 | 北京百度网讯科技有限公司 | 文档的向量化表示方法、装置和计算机设备 |
CN113051918B (zh) * | 2019-12-26 | 2024-05-14 | 北京中科闻歌科技股份有限公司 | 基于集成学习的命名实体识别方法、装置、设备和介质 |
CN112016313B (zh) * | 2020-09-08 | 2024-02-13 | 迪爱斯信息技术股份有限公司 | 口语化要素识别方法及装置、警情分析系统 |
CN112183111A (zh) * | 2020-09-28 | 2021-01-05 | 亚信科技(中国)有限公司 | 长文本语义相似度匹配方法、装置、电子设备及存储介质 |
CN113282749A (zh) * | 2021-05-20 | 2021-08-20 | 北京明略软件系统有限公司 | 一种会话情感分类方法、系统、电子设备及存储介质 |
CN113343669A (zh) * | 2021-05-20 | 2021-09-03 | 北京明略软件系统有限公司 | 一种学习字向量方法、系统、电子设备及存储介质 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170091318A1 (en) * | 2015-09-29 | 2017-03-30 | Kabushiki Kaisha Toshiba | Apparatus and method for extracting keywords from a single document |
CN107273355A (zh) * | 2017-06-12 | 2017-10-20 | 大连理工大学 | 一种基于字词联合训练的中文词向量生成方法 |
CN107688604A (zh) * | 2017-07-26 | 2018-02-13 | 阿里巴巴集团控股有限公司 | 数据应答处理方法、装置及服务器 |
CN109063035A (zh) * | 2018-07-16 | 2018-12-21 | 哈尔滨工业大学 | 一种面向出行领域的人机多轮对话方法 |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106569998A (zh) * | 2016-10-27 | 2017-04-19 | 浙江大学 | 一种基于Bi‑LSTM、CNN和CRF的文本命名实体识别方法 |
CN108509408B (zh) * | 2017-02-27 | 2019-11-22 | 芋头科技(杭州)有限公司 | 一种句子相似度判断方法 |
CN107168952B (zh) * | 2017-05-15 | 2021-06-04 | 北京百度网讯科技有限公司 | 基于人工智能的信息生成方法和装置 |
CN108132931B (zh) * | 2018-01-12 | 2021-06-25 | 鼎富智能科技有限公司 | 一种文本语义匹配的方法及装置 |
CN108717409A (zh) * | 2018-05-16 | 2018-10-30 | 联动优势科技有限公司 | 一种序列标注方法及装置 |
CN109271637B (zh) * | 2018-09-30 | 2023-12-01 | 科大讯飞股份有限公司 | 一种语义理解方法及装置 |
-
2019
- 2019-06-04 CN CN201910483399.6A patent/CN110298035B/zh active Active
- 2019-08-26 WO PCT/CN2019/102462 patent/WO2020244065A1/fr active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170091318A1 (en) * | 2015-09-29 | 2017-03-30 | Kabushiki Kaisha Toshiba | Apparatus and method for extracting keywords from a single document |
CN107273355A (zh) * | 2017-06-12 | 2017-10-20 | 大连理工大学 | 一种基于字词联合训练的中文词向量生成方法 |
CN107688604A (zh) * | 2017-07-26 | 2018-02-13 | 阿里巴巴集团控股有限公司 | 数据应答处理方法、装置及服务器 |
CN109063035A (zh) * | 2018-07-16 | 2018-12-21 | 哈尔滨工业大学 | 一种面向出行领域的人机多轮对话方法 |
Non-Patent Citations (1)
Title |
---|
LI, WEIKANG ET AL.: "Combination Methods of Chinese Character and Word Embeddings in Deep Learning", JOURNAL OF CHINESE INFORMATION PROCESSING, vol. 31, no. 6, 30 November 2017 (2017-11-30), pages 140 - 146, XP055765583 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112861531A (zh) * | 2021-03-22 | 2021-05-28 | 北京小米移动软件有限公司 | 分词方法、装置、存储介质和电子设备 |
CN112861531B (zh) * | 2021-03-22 | 2023-11-14 | 北京小米移动软件有限公司 | 分词方法、装置、存储介质和电子设备 |
Also Published As
Publication number | Publication date |
---|---|
CN110298035A (zh) | 2019-10-01 |
CN110298035B (zh) | 2023-12-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020244065A1 (fr) | Procédé, appareil et dispositif de définition de vecteur de caractère basés sur l'intelligence artificielle et support de stockage | |
US11610384B2 (en) | Zero-shot object detection | |
CN108959246B (zh) | 基于改进的注意力机制的答案选择方法、装置和电子设备 | |
WO2022022163A1 (fr) | Procédé d'apprentissage de modèle de classification de texte, dispositif, appareil, et support de stockage | |
US20200242444A1 (en) | Knowledge-graph-embedding-based question answering | |
KR101754473B1 (ko) | 문서를 이미지 기반 컨텐츠로 요약하여 제공하는 방법 및 시스템 | |
CN110377916B (zh) | 词预测方法、装置、计算机设备及存储介质 | |
JP2021152963A (ja) | 語義特徴の生成方法、モデルトレーニング方法、装置、機器、媒体及びプログラム | |
JP6848091B2 (ja) | 情報処理装置、情報処理方法、及びプログラム | |
WO2018196718A1 (fr) | Procédé et dispositif de désambiguïsation d'image, support de stockage et dispositif électronique | |
WO2022174496A1 (fr) | Procédé et appareil d'annotation de données basés sur un modèle génératif, dispositif et support de stockage | |
US11461613B2 (en) | Method and apparatus for multi-document question answering | |
CN113806582B (zh) | 图像检索方法、装置、电子设备和存储介质 | |
WO2014073206A1 (fr) | Dispositif de traitement de données, et procédé pour le traitement de données | |
US20240152770A1 (en) | Neural network search method and related device | |
EP3683694A1 (fr) | Dispositif et procédé de déduction de relation sémantique entre des mots | |
CN109271624B (zh) | 一种目标词确定方法、装置及存储介质 | |
CN112818091A (zh) | 基于关键词提取的对象查询方法、装置、介质与设备 | |
CN112183083A (zh) | 文摘自动生成方法、装置、电子设备及存储介质 | |
CN114995903B (zh) | 一种基于预训练语言模型的类别标签识别方法及装置 | |
CN107861948B (zh) | 一种标签提取方法、装置、设备和介质 | |
WO2022141872A1 (fr) | Procédé et appareil de génération de résumé de document, dispositif informatique et support de stockage | |
WO2022228127A1 (fr) | Procédé et appareil de traitement de texte d'élément, dispositif électronique et support de stockage | |
JP7291181B2 (ja) | 業界テキスト増分方法、関連装置、およびコンピュータプログラム製品 | |
EP4060526A1 (fr) | Procédé et dispositif de traitement de texte |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19932088 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19932088 Country of ref document: EP Kind code of ref document: A1 |