WO2020244065A1 - Procédé, appareil et dispositif de définition de vecteur de caractère basés sur l'intelligence artificielle et support de stockage - Google Patents

Procédé, appareil et dispositif de définition de vecteur de caractère basés sur l'intelligence artificielle et support de stockage Download PDF

Info

Publication number
WO2020244065A1
WO2020244065A1 PCT/CN2019/102462 CN2019102462W WO2020244065A1 WO 2020244065 A1 WO2020244065 A1 WO 2020244065A1 CN 2019102462 W CN2019102462 W CN 2019102462W WO 2020244065 A1 WO2020244065 A1 WO 2020244065A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
word
target
word vector
target word
Prior art date
Application number
PCT/CN2019/102462
Other languages
English (en)
Chinese (zh)
Inventor
陈闽川
马骏
王少军
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020244065A1 publication Critical patent/WO2020244065A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • This application relates to the field of word segmentation models, and in particular to an artificial intelligence-based character vector definition method, device, equipment and storage medium.
  • NLP natural language processing
  • word vector technology transforms words into dense vectors, and for similar words, their corresponding word vectors are also similar.
  • word vectors and word vectors are input as features of deep learning models. Therefore, the effect of the final model largely depends on the effect of word vectors and word vectors.
  • word vectors and word vectors are relatively independent.
  • entity recognition most of them use word vectors.
  • word vectors When doing text classification and topic extraction, most of them use word vectors for recognition. .
  • the disadvantage of word vectors is that they are large in number and in entity extraction, small words are prone to be wrong and big words are wrong.
  • the disadvantage of the word vector is that a single word may have completely irrelevant meanings, such as "old” in “old man” and “old” in "Laozi". His word vector can only represent one meaning and we understand Does not meet.
  • the inventor realizes that in the existing solution, for the same word, when applied to a word, the meaning of a single word may be completely irrelevant to the meaning of the word, and the word vector has a single meaning.
  • This application provides an artificial intelligence-based character vector definition method, device, equipment and storage medium, which are used to take a single character as the smallest structure, consider the combination of character vector and word vector, and assign each character to a different word Multiple different meanings increase the accuracy of the meaning of the word vector in the sentence.
  • the first aspect of the embodiments of the present application provides an artificial intelligence-based character vector definition method, including: acquiring a target text, the target text including a Chinese sentence that needs to be segmented; and segmenting the target text to obtain multiple words; Generate multiple corresponding target word vectors according to the multiple words; generate a target word vector according to the multiple target word vectors and a preset weighting strategy, each of the multiple words corresponding to the multiple target word vectors Both contain the words corresponding to the target word vector; each target word vector is input as a model parameter into the long and short-term memory network LSTM and the conditional random field CRF model, and entity recognition of the Chinese sentence is performed to generate predicted word segmentation.
  • the generating multiple target word vectors corresponding to the multiple words includes: inputting the multiple words into a preset algorithm In the model; each word is mapped into a K-dimensional word vector, where K is an integer greater than 0; the distance between each word vector is calculated; each word vector is determined according to the distance between each word vector According to the semantic similarity between each word vector, the vector with the highest semantic similarity with the target word among the multiple words is determined as the target word vector; the multiple target word vectors are determined, each Each target word vector corresponds to a word.
  • the method before the inputting the multiple words into the preset algorithm model, the method further includes: randomly generating a word vector Matrix, each row corresponds to a word vector; determine a target word in the word vector matrix, and extract the word vectors of surrounding words from the word vector matrix; calculate the mean vector of the word vectors of the surrounding words; The mean vector is input into a preset logistic regression model for training; a preset algorithm model is generated, and the probability vector output by the preset algorithm model matches the one-hot encoding vector of the target word.
  • the calculating the distance between each word vector includes: determining the first vector and the second vector in the word vector; calculating The cosine value between the first vector and the second vector satisfies the formula: Among them, D 1 and D 2 represent the first vector and the second vector, respectively, W 1k represents the weight of the first vector, W 2k represents the weight of the second vector, both the first vector and the second vector include N eigenvalues, 1 ⁇ k ⁇ N; the cosine value is determined as the distance between the first vector and the second vector.
  • the target word vector is generated according to the multiple target word vectors and a preset weight strategy, and the multiple target word vectors correspond to Each of the multiple words includes the word corresponding to the target word vector, including: obtaining a preset weight strategy, the preset weight strategy including the weight value of each word vector; determining the multiple target word vectors The target weight value of each word vector in the; generating a target word vector according to the multiple target word vectors and each target weight value of the multiple target word vectors.
  • the target word vector is generated according to the multiple target word vectors and a preset weight strategy, and the multiple target word vectors correspond to After each of the multiple words contains the word corresponding to the target word vector, the method further includes: inputting each target word vector as a model parameter into the long and short-term memory network LSTM and the conditional random field CRF model, The Chinese sentence performs entity recognition to generate predicted word segmentation.
  • each target word vector is input as a model parameter into the long and short-term memory network LSTM and the conditional random field CRF model
  • the Chinese Sentence entity recognition to generate predictive word segmentation includes: inputting each target word vector as a model parameter into the LSTM and CRF models; determining the position of each target word vector in the word space; according to each target word vector in the word space Perform entity recognition on the Chinese sentence at the position in to generate predicted word segmentation.
  • the second aspect of the embodiments of the present application provides an artificial intelligence-based character vector definition device, including: an acquisition unit for acquiring a target text, the target text including a Chinese sentence that needs to be segmented; a word segmentation unit for comparing The target text is segmented to obtain multiple words; the first generating unit is used to generate multiple corresponding target word vectors according to the multiple words; the second generating unit is used to generate multiple target word vectors according to the multiple target word vectors and the prediction
  • the weighting strategy for setting generates a target word vector, and each word in the plurality of words corresponding to the plurality of target word vectors contains a word corresponding to the target word vector.
  • the first generating unit includes: a first input module, configured to input the multiple words into a preset algorithm model; and a mapping module , Used to map each word into a K-dimensional word vector, where K is an integer greater than 0; the first calculation module is used to calculate the distance between each word vector; the first determination module is used to The distance between each word vector is used to determine the semantic similarity between each word vector; the second determining module is used to determine the semantic similarity between each word vector and the target among the multiple words The vector with the highest semantic similarity of words is the target word vector; the third determining module is used to determine multiple target word vectors, and each target word vector corresponds to a word.
  • the first generating unit further includes: a first generating module, configured to randomly generate a word vector matrix, and each row corresponds to a word vector; An extraction module, used to determine a target word in the word vector matrix, and extract the word vectors of surrounding words from the word vector matrix; a second calculation module, used to calculate the mean vector of the word vectors of the surrounding words;
  • the training module is used to input the mean vector into a preset logistic regression model for training;
  • the second generation module is used to generate a preset algorithm model, the probability vector output by the preset algorithm model is the same as the Match the one-hot encoding vector of the target word.
  • the first calculation module is specifically configured to: determine the first vector and the second vector in the word vector; calculate the first vector and The cosine value between the second vectors satisfies the formula: Among them, D 1 and D 2 represent the first vector and the second vector, respectively, W 1k represents the weight of the first vector, W 2k represents the weight of the second vector, both the first vector and the second vector include N eigenvalues, 1 ⁇ k ⁇ N; the cosine value is determined as the distance between the first vector and the second vector.
  • the second generating unit is specifically configured to: obtain a preset weight strategy, where the preset weight strategy includes the weight value of each word vector Determine the target weight value of each word vector in the multiple target word vectors; generate a target word vector according to the multiple target word vectors and each target weight value in the multiple target word vectors.
  • the artificial intelligence-based word vector definition device further includes: a third generating unit, configured to input each target word vector as a model parameter into the length In the temporal memory network LSTM and the conditional random field CRF model, entity recognition of the Chinese sentence is performed to generate predicted word segmentation.
  • a third generating unit configured to input each target word vector as a model parameter into the length In the temporal memory network LSTM and the conditional random field CRF model, entity recognition of the Chinese sentence is performed to generate predicted word segmentation.
  • the third generation unit includes: a second input module, configured to input each target word vector as a model parameter into the LSTM and CRF models;
  • the fourth determining module is used to determine the position of each target word vector in the word space;
  • the recognition generation module is used to perform entity recognition on the Chinese sentence according to the position of each target word vector in the word space to generate Predict word segmentation.
  • the third aspect of the embodiments of the present application provides an artificial intelligence-based word vector definition device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor.
  • the processor When the computer program is executed, the artificial intelligence-based word vector definition method described in any of the above embodiments is implemented.
  • the fourth aspect of the embodiments of the present application provides a non-volatile computer-readable storage medium, including instructions, which when run on a computer, cause the computer to execute the artificial intelligence-based The steps of the word vector definition method.
  • the target text is obtained, and the target text includes Chinese sentences that need to be segmented; the target text is segmented to obtain multiple words; multiple corresponding target word vectors are generated according to the multiple words; The target word vector and the preset weighting strategy generate a target word vector, and each of the multiple words corresponding to the multiple target word vectors contains a word corresponding to the target word vector.
  • a single word is taken as the minimum structure, and the combination of word vector and word vector is considered, and each word is given multiple different meanings in different words, which increases the accuracy of the meaning of the word vector in the sentence, and then Improve the efficiency of Chinese word segmentation.
  • FIG. 1 is a schematic diagram of an embodiment of a method for defining a word vector based on artificial intelligence in an embodiment of the application;
  • FIG. 2 is a schematic diagram of another embodiment of a method for defining a word vector based on artificial intelligence in an embodiment of the application;
  • FIG. 3 is a schematic diagram of an embodiment of an artificial intelligence-based word vector definition device in an embodiment of the application
  • FIG. 4 is a schematic diagram of another embodiment of an artificial intelligence-based word vector definition device in an embodiment of this application.
  • Fig. 5 is a schematic diagram of an embodiment of an artificial intelligence-based word vector definition device in an embodiment of the application.
  • This application provides an artificial intelligence-based character vector definition method, device, equipment and storage medium, which are used to take a single character as the smallest structure, consider the combination of character vector and word vector, and assign each character to a different word Multiple different meanings increase the accuracy of the meaning of each word in the sentence and improve the efficiency of Chinese word segmentation.
  • the flowchart of the artificial intelligence-based word vector definition method specifically includes:
  • target text where the target text includes Chinese sentences that need to be segmented.
  • the server obtains the target text, and the target text includes Chinese sentences that need to be segmented.
  • the Chinese sentence may be "I like Apple”, or “Engineer Fu hits the computer”, etc.
  • the embodiment of the application uses “Engineer Fu hits the computer” as the Chinese sentence for description.
  • the execution subject of this application may be a word vector definition device based on artificial intelligence, or a terminal or a server, which is not specifically limited here.
  • This application takes the server as the execution subject as an example for description.
  • the server uses preset word segmentation tools, such as the Chinese word segmentation tool, HanLP, etc., to segment the target text to obtain multiple words. For example, if the target text is “I am an algorithm engineer”, then five words such as “I”, “Yes”, “one”, “algorithm”, and “engineer” can be obtained respectively.
  • preset word segmentation tools such as the Chinese word segmentation tool, HanLP, etc.
  • the target text "Engineer Fu hits the computer” is segmented, using 3-Gram or 2-Gram segmentation, and it is found that the text has “engineer” and “Chengshi” in front of the text, followed by " ⁇ ” and “ ⁇ ”.
  • the server defines multiple target word vectors corresponding to multiple words. Specifically, the server inputs multiple words into a preset algorithm model; the server maps each word into a K-dimensional word vector, where K is an integer greater than 0; the server calculates the distance between each word vector; The server determines the semantic similarity between each word vector according to the distance between each word vector; the server determines the vector with the highest semantic similarity to the target word among multiple words according to the semantic similarity between each word vector Target word vector: The server determines multiple target word vectors, and each target word vector corresponds to a word.
  • the server inputs the four words identified above into the preset model through the preset model, and matches the two words “engineer” and “master” in the preset model, and determines the word vector "engineer” “And the word vector "Master”.
  • the generated preset model needs to ensure the validity of each word vector. If there are only “engineering” and “master” in the word vector, when we encounter a sentence "engineering is a good job", the result predicted by the preset model can only match the existing word vector, if Continuing to do entity recognition, you will get that "engineer (B) division (E) is” split the whole “engineer”, although the results can be obtained, but the information gap is split.
  • target text generally refers to various machine-readable records.
  • the text is represented by D (Document), and the feature item is represented by T (Term).
  • T refers to the basic language unit that is present in document D and can represent the content of the document. It is mainly composed of words or phrases.
  • the text can be represented by a feature set Is D(T1, T2,..., Tn), where Tk is the characteristic item, and 1 ⁇ k ⁇ N.
  • Tk is the characteristic item
  • 104 Generate a target word vector according to a plurality of target word vectors and a preset weighting strategy, and each of the plurality of words corresponding to the plurality of target word vectors includes a word corresponding to the target word vector.
  • the server generates a target word vector according to a plurality of corresponding word vectors and a preset weighting strategy, where each word in the plurality of words corresponding to the plurality of word vectors includes a word corresponding to the word vector. For example, to define the word vector for the word " ⁇ ", the current word vector of " ⁇ " is (teacher's word vector + master's word vector + engineer's word vector)/3, and the target word vector of " ⁇ " is obtained.
  • the target text is obtained, and the target text includes Chinese sentences that need to be segmented; the target text is segmented to obtain multiple words; multiple corresponding target word vectors are generated according to the multiple words; according to multiple target word vectors and predictions
  • the weighting strategy of setting generates the target word vector, and each word in the multiple words corresponding to the multiple target word vectors contains the word corresponding to the target word vector. Taking a single word as the minimum structure, considering the combination of word vector and word vector, assigning multiple different meanings to each word in different words, increasing the accuracy of the meaning of the word vector in the sentence.
  • FIG. 2 another flowchart of the artificial intelligence-based word vector definition method provided by the embodiment of the present application, which specifically includes:
  • the server obtains the target text, and the target text includes Chinese sentences that need to be segmented.
  • the Chinese sentence may be "I like Apple”, or “Engineer Fu hits the computer”, etc.
  • the embodiment of the application uses “Engineer Fu hits the computer” as the Chinese sentence for description.
  • the execution subject of this application may be a word vector definition device based on artificial intelligence, or a terminal or a server, which is not specifically limited here.
  • This application takes the server as the execution subject as an example for description.
  • the server uses a preset word segmentation tool, such as a Chinese word segmentation tool, HanLP, etc., to segment the target text to obtain multiple words. For example, if the target text is “I am an algorithm engineer”, then five words such as “I”, “Yes”, “one”, “algorithm”, and “engineer” can be obtained respectively.
  • a preset word segmentation tool such as a Chinese word segmentation tool, HanLP, etc.
  • the target text "Engineer Fu hits the computer” is segmented, using 3-Gram or 2-Gram segmentation, and it is found that the text has “engineer” and “Chengshi” in front of the text, followed by " ⁇ ” and “ ⁇ ”.
  • the server defines multiple target word vectors corresponding to multiple words. Specifically, the server inputs multiple words into a preset algorithm model; the server maps each word into a K-dimensional word vector, where K is an integer greater than 0; the server calculates the distance between each word vector; The server determines the semantic similarity between each word vector according to the distance between each word vector; the server determines the vector with the highest semantic similarity to the target word among multiple words according to the semantic similarity between each word vector Target word vector: The server determines multiple target word vectors, and each target word vector corresponds to a word.
  • the server inputs the four words identified above into the preset model through the preset model, and matches the two words “engineer” and “master” in the preset model, and determines the word vector "engineer” “And the word vector "Master”.
  • the generated preset model needs to ensure the validity of each word vector. If there are only “engineering” and “master” in the word vector, when we encounter a sentence "engineering is a good job", the result predicted by the preset model can only match the existing word vector, if Continuing to do entity recognition, you will get that "engineer (B) division (E) is” split the whole “engineer”, although the results can be obtained, but the information gap is split.
  • target text generally refers to various machine-readable records.
  • the text is represented by D (Document), and the feature item is represented by T (Term).
  • T refers to the basic language unit that is present in the document D and can represent the content of the document. It is mainly composed of words or phrases.
  • the text can be represented by a feature set Is D(T1, T2,..., Tn), where Tk is the feature item. For example, there are four feature items of a, b, c, d in a document, then the document can be expressed as D(a, b, c, d). For a text containing n feature items, each feature item is usually given a certain weight to indicate its importance.
  • Wk is the weight of Tk, 1 ⁇ k ⁇ N.
  • the vector of the text can be expressed as D(30, 20, 20, 10).
  • the content correlation between two texts D1 and D2 Sim(D1, D2) is usually expressed by the cosine value of the angle between the vectors.
  • the feature items of text D1 are a, b, c, d, and the weights are 30, 20, 20, 10, respectively, and the feature items of text C1 are a, c, d, e, and the weights are 40, 30, respectively.
  • the vector of D1 is represented as D1(30,20,20,10,0)
  • the vector of C1 is represented as C1(40,0,30,20,10)
  • the calculated text D1 and text C1 The similarity is 0.86.
  • each of the multiple words corresponding to the multiple target word vectors includes a word corresponding to the target word vector.
  • the server generates a target word vector according to a plurality of corresponding word vectors and a preset weighting strategy, where each word in the plurality of words corresponding to the plurality of word vectors includes a word corresponding to the word vector.
  • the current word vector of " ⁇ " is (teacher's word vector + master's word vector + engineer's word vector)/3, and the target word vector of " ⁇ " is obtained.
  • each target word vector as a model parameter into the long and short-term memory network LSTM and the conditional random field CRF model, and perform entity recognition on the Chinese sentence to generate predicted word segmentation.
  • the server inputs each target word vector as a model parameter into the long and short-term memory network LSTM and the conditional random field CRF model, and performs entity recognition on the Chinese sentence to generate predictive word segmentation.
  • the server inputs each target word vector as a model parameter into the LSTM and CRF models; the server determines the position of each target word vector in the word space; the server performs an analysis of the Chinese character according to the position of each target word vector in the word space Sentences perform entity recognition and generate predictive word segmentation.
  • entity recognition is performed on the Chinese sentence according to the position of each target word vector in the character space, and the process of generating predicted word segmentation specifically includes:
  • the preset formula is: Among them, P is the score matrix mapped from the output of the bidirectional LSTM after the fully connected layer, P i,j represents the score of the jth label corresponding to the i-th word in the Chinese sentence, and its dimension is n ⁇ k, and k is the label
  • A represents the transition matrix of the word segmentation label
  • a i, j represents the transition score between label i and label j, 1 ⁇ i ⁇ k, 1 ⁇ j ⁇ k; determine the probability p of the word segmentation label sequence, which satisfies the formula:
  • Calculate the loss function of probability p Among them, Y X represents all the label sequences of Chinese sentence X; the label with the highest score is determined according to the loss function as the predicted word segmentation
  • the target text is obtained, and the target text includes Chinese sentences that need to be segmented; the target text is segmented to obtain multiple words; multiple corresponding target word vectors are generated according to the multiple words; according to multiple target word vectors and predictions
  • the weighting strategy of setting generates the target word vector, and each word in the multiple words corresponding to the multiple target word vectors contains the word corresponding to the target word vector.
  • An embodiment of the vector definition device includes:
  • the obtaining unit 301 is configured to obtain a target text, the target text including a Chinese sentence that needs to be segmented;
  • the word segmentation unit 302 is configured to segment the target text to obtain multiple words
  • the first generating unit 303 is configured to generate multiple corresponding target word vectors according to the multiple words
  • the second generating unit 304 is configured to generate a target word vector according to the multiple target word vectors and a preset weighting strategy, each of the multiple words corresponding to the multiple target word vectors contains the target word vector The corresponding word.
  • the target text is obtained, and the target text includes Chinese sentences that need to be segmented; the target text is segmented to obtain multiple words; multiple corresponding target word vectors are generated according to the multiple words; according to multiple target word vectors and predictions
  • the weighting strategy of setting generates the target word vector, and each word in the multiple words corresponding to the multiple target word vectors contains the word corresponding to the target word vector. Taking a single word as the minimum structure, considering the combination of word vector and word vector, assigning multiple different meanings to each word in different words, increasing the accuracy of the meaning of the word vector in the sentence.
  • another embodiment of the device for defining a word vector based on artificial intelligence in the embodiment of the present application includes:
  • the obtaining unit 301 is configured to obtain a target text, the target text including a Chinese sentence that needs to be segmented;
  • the word segmentation unit 302 is configured to segment the target text to obtain multiple words
  • the first generating unit 303 is configured to generate multiple corresponding target word vectors according to the multiple words
  • the second generating unit 304 is configured to generate a target word vector according to the multiple target word vectors and a preset weighting strategy, each of the multiple words corresponding to the multiple target word vectors contains the target word vector The corresponding word.
  • the first generating unit 303 includes:
  • the first input module 30301 is configured to input the multiple words into a preset algorithm model
  • the mapping module 30302 is used to map each word into a K-dimensional word vector, where K is an integer greater than 0;
  • the first calculation module 30303 is used to calculate the distance between each word vector; the first determination module is used to determine the semantic similarity between each word vector according to the distance between each word vector;
  • the second determining module 30304 is configured to determine, according to the semantic similarity between each word vector, the vector with the highest semantic similarity to the target word among the multiple words as the target word vector;
  • the third determining module 30305 is configured to determine multiple target word vectors, and each target word vector corresponds to a word.
  • the first generating unit 303 further includes:
  • the first generating module 30306 is used to randomly generate a word vector matrix, and each row corresponds to a word vector;
  • the extraction module 30307 is configured to determine a target word in the word vector matrix, and extract the word vectors of surrounding words from the word vector matrix;
  • the second calculation module 30308 is configured to calculate the mean vector of the word vectors of the surrounding words
  • the training module 30309 is configured to input the mean vector into a preset logistic regression model for training
  • the second generation module 30310 is configured to generate a preset algorithm model, and the probability vector output by the preset algorithm model matches the one-hot encoding vector of the target word.
  • the first calculation module 30303 is specifically configured to:
  • D 1 and D 2 represent the first vector and the second vector, respectively, W 1k represents the weight of the first vector, W 2k represents the weight of the second vector, both the first vector and the second vector include N eigenvalues, 1 ⁇ k ⁇ N; the cosine value is determined as the distance between the first vector and the second vector.
  • the second generating unit 304 is specifically configured to:
  • the preset weight strategy includes the weight value of each word vector; determine the target weight value of each word vector in the plurality of target word vectors; according to the plurality of target word vectors and Each target weight value in the multiple target word vectors generates a target word vector.
  • the word vector definition device based on artificial intelligence further includes:
  • the third generating unit 305 is configured to input each target word vector as a model parameter into the long and short-term memory network LSTM and the conditional random field CRF model, and perform entity recognition on the Chinese sentence to generate predicted word segmentation.
  • the third generating unit 305 specifically includes:
  • the second input module 3051 is used to input each target word vector as a model parameter into the LSTM and CRF models;
  • the fourth determining module 3052 is used to determine the position of each target word vector in the word space
  • the recognition generating module 3053 is configured to perform entity recognition on the Chinese sentence according to the position of each target word vector in the character space to generate predicted word segmentation.
  • the target text is obtained, and the target text includes Chinese sentences that need to be segmented; the target text is segmented to obtain multiple words; multiple corresponding target word vectors are generated according to the multiple words; according to multiple target word vectors and predictions
  • the weighting strategy of setting generates the target word vector, and each word in the multiple words corresponding to the multiple target word vectors contains the word corresponding to the target word vector.
  • FIG. 5 is a schematic structural diagram of an artificial intelligence-based word vector definition device provided by an embodiment of the present application.
  • the artificial intelligence-based word vector definition device 500 may have relatively large differences due to different configurations or performance, and may include one or One or more processors (central processing units, CPU) 501 (for example, one or more processors) and a memory 509, one or more storage media 508 for storing application programs 507 or data 506 (for example, one or one storage device with a large amount of ).
  • the memory 509 and the storage medium 508 may be short-term storage or persistent storage.
  • the program stored in the storage medium 508 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the artificial intelligence-based word vector definition device.
  • the processor 501 may be configured to communicate with the storage medium 508, and execute a series of instruction operations in the storage medium 508 on the artificial intelligence-based word vector definition device 500.
  • the artificial intelligence-based word vector definition device 500 may also include one or more power sources 502, one or more wired or wireless network interfaces 503, one or more input and output interfaces 504, and/or, one or more operating systems 505 , Such as Windows Serve, Mac OS X, Unix, Linux, FreeBSD and so on.
  • operating systems 505 Such as Windows Serve, Mac OS X, Unix, Linux, FreeBSD and so on.
  • the processor 501 can perform the functions of the acquiring unit 301, the word segmentation unit 302, the first generating unit 303, the second generating unit 304, and the third generating unit 305 in the foregoing embodiment.
  • the processor 501 is the control center of the artificial intelligence-based word vector definition device, and can perform processing according to the set artificial intelligence-based word vector definition method.
  • the processor 501 uses various interfaces and lines to connect the various parts of the entire artificial intelligence-based word vector definition device, and by running or executing software programs and/or modules stored in the memory 509, and calling data stored in the memory 509, Execute various functions and processing data of the device based on artificial intelligence word vector definition, and convert unreadable labels in the message domain into readable labels, thereby realizing rapid identification of application scenarios in the message.
  • the storage medium 508 and the memory 509 are both carriers for storing data. In the embodiment of the present application, the storage medium 508 may refer to an internal memory with a small storage capacity but high speed, and the storage 509 may have a large storage capacity but a slow storage speed. External memory.
  • the memory 509 can be used to store software programs and modules, and the processor 501 executes various functional applications and data processing of the word vector definition device 500 based on artificial intelligence by running the software programs and modules stored in the memory 509.
  • the memory 509 may mainly include a storage program area and a storage data area.
  • the storage program area may store an operating system and an application program required by at least one function (for example, a target word vector is generated according to multiple target word vectors and a preset weight strategy)
  • the storage data area can store data (such as multiple target word vectors) created according to the use of artificial intelligence-based word vector definition equipment.
  • the memory 509 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other non-volatile solid-state storage devices.
  • a non-volatile memory such as at least one magnetic disk storage device, a flash memory device, or other non-volatile solid-state storage devices.
  • the artificial intelligence-based word vector definition method program and the received data stream provided in the embodiment of the present application are stored in the memory, and the processor 501 is called from the memory 509 when needed.
  • the present application also provides a non-volatile computer-readable storage medium, including instructions, which when run on a computer, cause the computer to execute the following steps of the artificial intelligence-based word vector definition method:
  • Target text includes Chinese sentences that require word segmentation
  • a target word vector is generated according to the multiple target word vectors and a preset weighting strategy, and each of the multiple words corresponding to the multiple target word vectors includes a word corresponding to the target word vector.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium.
  • the computer instructions may be transmitted from a website, computer, server, or data center. Transmission to another website site, computer, server or data center via wired (such as coaxial cable, optical fiber, twisted pair) or wireless (such as infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a server or data center integrated with one or more available media.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, an optical disc), or a semiconductor medium (for example, a solid state disk (SSD)).
  • the disclosed system, device, and method may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention relève du domaine de l'intelligence artificielle, et en particulier elle relève du domaine des modèles de segmentation en mots. L'invention concerne un procédé, un appareil et un dispositif de définition de vecteur de caractère basés sur l'intelligence artificielle, et un support de stockage, qui sont utilisés pour donner une pluralité de significations différentes à chaque caractère dans différents mots en prenant un caractère unique en tant que structure minimale, ce qui permet d'améliorer la précision de la signification d'un vecteur de caractère dans une phrase. Le procédé de la présente invention consiste à : acquérir un texte cible, le texte cible comprenant un énoncé en chinois devant être soumis à une segmentation en mots ; effectuer une segmentation en mots sur le texte cible pour obtenir une pluralité de mots ; générer une pluralité de vecteurs de mots cibles correspondants en fonction de la pluralité de mots ; et générer des vecteurs de caractères cibles selon la pluralité de vecteurs de mots cibles et d'une politique de poids prédéfinie, chaque mot parmi la pluralité de mots correspondant à la pluralité de vecteurs de mots cibles comprenant des caractères correspondant aux vecteurs de caractères cibles.
PCT/CN2019/102462 2019-06-04 2019-08-26 Procédé, appareil et dispositif de définition de vecteur de caractère basés sur l'intelligence artificielle et support de stockage WO2020244065A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910483399.6A CN110298035B (zh) 2019-06-04 2019-06-04 基于人工智能的字向量定义方法、装置、设备及存储介质
CN201910483399.6 2019-06-04

Publications (1)

Publication Number Publication Date
WO2020244065A1 true WO2020244065A1 (fr) 2020-12-10

Family

ID=68027590

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/102462 WO2020244065A1 (fr) 2019-06-04 2019-08-26 Procédé, appareil et dispositif de définition de vecteur de caractère basés sur l'intelligence artificielle et support de stockage

Country Status (2)

Country Link
CN (1) CN110298035B (fr)
WO (1) WO2020244065A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112861531A (zh) * 2021-03-22 2021-05-28 北京小米移动软件有限公司 分词方法、装置、存储介质和电子设备

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110928936B (zh) * 2019-10-18 2023-06-16 平安科技(深圳)有限公司 基于强化学习的信息处理方法、装置、设备和存储介质
CN110797005B (zh) * 2019-11-05 2022-06-10 百度在线网络技术(北京)有限公司 韵律预测方法、装置、设备和介质
CN111079442B (zh) * 2019-12-20 2021-05-18 北京百度网讯科技有限公司 文档的向量化表示方法、装置和计算机设备
CN113051918B (zh) * 2019-12-26 2024-05-14 北京中科闻歌科技股份有限公司 基于集成学习的命名实体识别方法、装置、设备和介质
CN112016313B (zh) * 2020-09-08 2024-02-13 迪爱斯信息技术股份有限公司 口语化要素识别方法及装置、警情分析系统
CN112183111A (zh) * 2020-09-28 2021-01-05 亚信科技(中国)有限公司 长文本语义相似度匹配方法、装置、电子设备及存储介质
CN113282749A (zh) * 2021-05-20 2021-08-20 北京明略软件系统有限公司 一种会话情感分类方法、系统、电子设备及存储介质
CN113343669A (zh) * 2021-05-20 2021-09-03 北京明略软件系统有限公司 一种学习字向量方法、系统、电子设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170091318A1 (en) * 2015-09-29 2017-03-30 Kabushiki Kaisha Toshiba Apparatus and method for extracting keywords from a single document
CN107273355A (zh) * 2017-06-12 2017-10-20 大连理工大学 一种基于字词联合训练的中文词向量生成方法
CN107688604A (zh) * 2017-07-26 2018-02-13 阿里巴巴集团控股有限公司 数据应答处理方法、装置及服务器
CN109063035A (zh) * 2018-07-16 2018-12-21 哈尔滨工业大学 一种面向出行领域的人机多轮对话方法

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106569998A (zh) * 2016-10-27 2017-04-19 浙江大学 一种基于Bi‑LSTM、CNN和CRF的文本命名实体识别方法
CN108509408B (zh) * 2017-02-27 2019-11-22 芋头科技(杭州)有限公司 一种句子相似度判断方法
CN107168952B (zh) * 2017-05-15 2021-06-04 北京百度网讯科技有限公司 基于人工智能的信息生成方法和装置
CN108132931B (zh) * 2018-01-12 2021-06-25 鼎富智能科技有限公司 一种文本语义匹配的方法及装置
CN108717409A (zh) * 2018-05-16 2018-10-30 联动优势科技有限公司 一种序列标注方法及装置
CN109271637B (zh) * 2018-09-30 2023-12-01 科大讯飞股份有限公司 一种语义理解方法及装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170091318A1 (en) * 2015-09-29 2017-03-30 Kabushiki Kaisha Toshiba Apparatus and method for extracting keywords from a single document
CN107273355A (zh) * 2017-06-12 2017-10-20 大连理工大学 一种基于字词联合训练的中文词向量生成方法
CN107688604A (zh) * 2017-07-26 2018-02-13 阿里巴巴集团控股有限公司 数据应答处理方法、装置及服务器
CN109063035A (zh) * 2018-07-16 2018-12-21 哈尔滨工业大学 一种面向出行领域的人机多轮对话方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LI, WEIKANG ET AL.: "Combination Methods of Chinese Character and Word Embeddings in Deep Learning", JOURNAL OF CHINESE INFORMATION PROCESSING, vol. 31, no. 6, 30 November 2017 (2017-11-30), pages 140 - 146, XP055765583 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112861531A (zh) * 2021-03-22 2021-05-28 北京小米移动软件有限公司 分词方法、装置、存储介质和电子设备
CN112861531B (zh) * 2021-03-22 2023-11-14 北京小米移动软件有限公司 分词方法、装置、存储介质和电子设备

Also Published As

Publication number Publication date
CN110298035A (zh) 2019-10-01
CN110298035B (zh) 2023-12-01

Similar Documents

Publication Publication Date Title
WO2020244065A1 (fr) Procédé, appareil et dispositif de définition de vecteur de caractère basés sur l'intelligence artificielle et support de stockage
US11610384B2 (en) Zero-shot object detection
CN108959246B (zh) 基于改进的注意力机制的答案选择方法、装置和电子设备
WO2022022163A1 (fr) Procédé d'apprentissage de modèle de classification de texte, dispositif, appareil, et support de stockage
US20200242444A1 (en) Knowledge-graph-embedding-based question answering
KR101754473B1 (ko) 문서를 이미지 기반 컨텐츠로 요약하여 제공하는 방법 및 시스템
CN110377916B (zh) 词预测方法、装置、计算机设备及存储介质
JP2021152963A (ja) 語義特徴の生成方法、モデルトレーニング方法、装置、機器、媒体及びプログラム
JP6848091B2 (ja) 情報処理装置、情報処理方法、及びプログラム
WO2018196718A1 (fr) Procédé et dispositif de désambiguïsation d'image, support de stockage et dispositif électronique
WO2022174496A1 (fr) Procédé et appareil d'annotation de données basés sur un modèle génératif, dispositif et support de stockage
US11461613B2 (en) Method and apparatus for multi-document question answering
CN113806582B (zh) 图像检索方法、装置、电子设备和存储介质
WO2014073206A1 (fr) Dispositif de traitement de données, et procédé pour le traitement de données
US20240152770A1 (en) Neural network search method and related device
EP3683694A1 (fr) Dispositif et procédé de déduction de relation sémantique entre des mots
CN109271624B (zh) 一种目标词确定方法、装置及存储介质
CN112818091A (zh) 基于关键词提取的对象查询方法、装置、介质与设备
CN112183083A (zh) 文摘自动生成方法、装置、电子设备及存储介质
CN114995903B (zh) 一种基于预训练语言模型的类别标签识别方法及装置
CN107861948B (zh) 一种标签提取方法、装置、设备和介质
WO2022141872A1 (fr) Procédé et appareil de génération de résumé de document, dispositif informatique et support de stockage
WO2022228127A1 (fr) Procédé et appareil de traitement de texte d'élément, dispositif électronique et support de stockage
JP7291181B2 (ja) 業界テキスト増分方法、関連装置、およびコンピュータプログラム製品
EP4060526A1 (fr) Procédé et dispositif de traitement de texte

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19932088

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19932088

Country of ref document: EP

Kind code of ref document: A1