WO2021134524A1 - Procédé de traitement de données, appareil, dispositif électronique et support d'enregistrement - Google Patents

Procédé de traitement de données, appareil, dispositif électronique et support d'enregistrement Download PDF

Info

Publication number
WO2021134524A1
WO2021134524A1 PCT/CN2019/130650 CN2019130650W WO2021134524A1 WO 2021134524 A1 WO2021134524 A1 WO 2021134524A1 CN 2019130650 W CN2019130650 W CN 2019130650W WO 2021134524 A1 WO2021134524 A1 WO 2021134524A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
word
candidate
dictionary
document
Prior art date
Application number
PCT/CN2019/130650
Other languages
English (en)
Chinese (zh)
Inventor
朱会峰
Original Assignee
深圳市欢太科技有限公司
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市欢太科技有限公司, Oppo广东移动通信有限公司 filed Critical 深圳市欢太科技有限公司
Priority to PCT/CN2019/130650 priority Critical patent/WO2021134524A1/fr
Priority to CN201980101007.3A priority patent/CN114556328B/zh
Publication of WO2021134524A1 publication Critical patent/WO2021134524A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri

Definitions

  • This application relates to simultaneous interpretation technology, in particular to a data processing method, device, electronic equipment and storage medium.
  • speech recognition technology has been widely used as the entrance to speech interaction, especially for simultaneous interpretation systems.
  • the effect of speech recognition has been greatly improved, and related application fields are also increasing. , Including technology, finance, tourism, medical care, insurance and other fields.
  • embodiments of the present application provide a data processing method, device, electronic equipment, and storage medium.
  • the embodiment of the application provides a data processing method, including:
  • the language model is used to perform text recognition on voice data including the target keyword or the target word to obtain a recognition result; the recognition result is used to present the voice data when the voice data is played.
  • the training of the language model using the updated dictionary includes:
  • the method further includes:
  • a corpus corresponding to the dictionary is generated.
  • the updating the dictionary according to the at least one target word includes:
  • the dictionary is updated according to the at least one target word and the pronunciation of each target word in the at least one target word.
  • the obtaining the target keywords from the target document includes:
  • the use of candidate keywords that meet the first preset condition in the candidate keyword list as the target keywords includes:
  • the word correlation represents the correlation between the corresponding candidate keyword and other candidate keywords in the candidate keyword list
  • the candidate keywords whose word relevance exceeds a first preset threshold in the candidate keyword list are used as the target keywords.
  • the determining at least one target word from the webpage document includes:
  • Screen at least one word obtained from the web page document, and generate a candidate target word list according to the word obtained after screening;
  • the candidate target words that meet the second preset condition in the candidate target word list are used as the target words.
  • the step of using the candidate target words that meet the second preset condition in the candidate target word list as the target words includes:
  • the candidate target words in the candidate target word list whose relevance exceeds a second preset threshold and do not belong to the dictionary are used as target words.
  • the embodiment of the present application also provides a data processing device, including:
  • the obtaining unit is configured to obtain target keywords from the target document
  • the first processing unit is configured to obtain a related web document according to the target keyword, and determine at least one target word from the web document; the target word does not belong to a preset dictionary;
  • the second processing unit is configured to update the dictionary according to the at least one target word, and use the updated dictionary to train a language model;
  • the data is text-recognized to obtain a recognition result; the recognition result is used to present the voice data when the voice data is played; the target document is related to the voice data.
  • the embodiment of the present application further provides an electronic device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor.
  • the processor implements any of the foregoing data processing methods when the program is executed. step.
  • the embodiment of the present application also provides a storage medium on which computer instructions are stored, and when the instructions are executed by a processor, the steps of any of the foregoing data processing methods are implemented.
  • the data processing method, device, electronic device, and storage medium obtained by the embodiments of the present application obtain target keywords from a target document; obtain related webpage documents according to the target keywords, and determine at least one target word from the webpage documents
  • the target word does not belong to a preset dictionary;
  • the dictionary is updated according to the at least one target word, and the updated dictionary is used to train a language model;
  • the language model is used to identify the target keywords or the
  • the voice data of the target word is text recognized to obtain the recognition result; the recognition result is used to present the voice data when the voice data is played.
  • Professional vocabulary ie target word
  • the dictionary obtained based on the updated dictionary training can be used for text recognition in the corresponding professional field, avoiding problems such as inability to recognize vocabulary and recognizing vocabulary errors, and improving the accuracy of recognition.
  • Figure 1 is a schematic diagram of the system architecture of the application of simultaneous interpretation methods in related technologies
  • FIG. 2 is a schematic flowchart of a data processing method according to an embodiment of the application
  • FIG. 3 is a schematic diagram of another flow chart of a data processing method according to an embodiment of the application.
  • FIG. 4 is a schematic flowchart of still another data processing method according to an embodiment of the application.
  • FIG. 5 is a schematic diagram of the composition structure of a data processing device according to an embodiment of the application.
  • FIG. 6 is a schematic diagram of the composition structure of an electronic device according to an embodiment of the application.
  • Figure 1 is a schematic diagram of the system architecture of the application of the simultaneous interpretation method in the related technology; as shown in Figure 1, the system may include: a machine simultaneous interpretation server, a voice processing server, a viewer mobile terminal, and a personal computer (PC, Personal Computer) Client, display screen.
  • a machine simultaneous interpretation server a voice processing server
  • a viewer mobile terminal a viewer mobile terminal
  • a personal computer (PC, Personal Computer) Client display screen.
  • the lecturer can give conference lectures through the PC client.
  • the PC client collects the lecturer's voice data and sends the collected voice data to the machine simultaneous interpretation server.
  • the machine simultaneous interpretation server recognizes the voice data through the voice processing server, and obtains the recognition result (the recognition result may be a recognized text in the same language as the voice data, or it may be obtained after translating the recognized text Translated texts in other languages).
  • the machine simultaneous interpretation server can send the recognition result to the PC client, and the PC client screens the recognition result on the display screen; it can also send the recognition result to the viewer's mobile terminal (specifically according to the language required by the user, Correspondingly send the recognition result of the corresponding language) to show the recognition result to the user.
  • the system can realize the translation of the lecture content of the lecturer into the language required by the user and display it.
  • the voice processing server may use the acoustic model to perform voice recognition on the voice data to obtain the recognition result.
  • simultaneous interpretation in a specific field needs to collect and label a large number of speech corpus and text corpus, and conduct supervised model training.
  • the commonly used method is to use the acoustic model adaptation method to achieve model adaptation. This requires obtaining a certain amount of speech data, and obtaining the first recognition result through recognition; then further acoustic model training is performed based on the recognition result to improve a certain specific The recognition result of the professional field or the speaker.
  • target keywords are obtained from a target document; related webpage documents are obtained according to the target keywords, and at least one target word (specifically refers to a certain professional field) is determined from the webpage document.
  • the new word the target word does not belong to the preset dictionary; the dictionary is updated according to the at least one target word, and the language model is trained using the updated dictionary; in this way, the language obtained based on the updated dictionary training is used
  • the model performs text recognition on the speech data in the corresponding professional field, which can avoid the problems of vocabulary recognition failure and recognition error in the corresponding professional field, and improve the accuracy of recognition.
  • FIG. 2 is a schematic flowchart of a data processing method according to an embodiment of the application; as shown in FIG. 2, the method includes:
  • Step 201 Obtain target keywords from the target document.
  • Step 202 Obtain a related webpage document according to the target keyword, and determine at least one target word from the webpage document; the target word does not belong to a preset dictionary.
  • Step 203 Update the dictionary according to the at least one target word, and train a language model using the updated dictionary;
  • the language model is used to perform text recognition on voice data including the target keyword or the target word to obtain a recognition result; the recognition result is used to present the voice data when the voice data is played.
  • the recognition result is used to present the voice data when the voice data is played, which means that the recognition result is presented while the voice data is being played, that is, the data data processing method is applied to the scene of simultaneous interpretation.
  • the text recognition may include the speech recognition, text translation, and the like.
  • the voice recognition includes recognizing voice data to obtain a recognized text corresponding to the voice data; the language corresponding to the recognized text is the same as the language corresponding to the voice data.
  • the text translation includes translating the recognized text to obtain recognized texts in other languages.
  • the recognition result may include recognition text in at least one language, that is, including recognition text in the same language or recognized text in other languages.
  • the first terminal when the speaker is giving a speech, the first terminal (the PC as shown in Figure 1) uses the voice collection module to collect the content of the speech in real time to obtain the voice data to be processed; A communication connection may be established between a terminal and a server for realizing simultaneous interpretation, and the first terminal sends the acquired voice data to the server for realizing simultaneous interpretation.
  • the server for realizing simultaneous interpretation can obtain the voice data to be processed in real time.
  • the server performs voice recognition on the voice data to be processed, obtains and presents the recognition result, that is, realizes that the recognition result is presented while the voice data is being played.
  • the simultaneous interpretation scene may adopt the system architecture shown in FIG. 1, and the method in the embodiment of the present application is applied to an electronic device.
  • the electronic device may be a server, a mobile terminal, or the like.
  • the mobile terminal may be a PC, a tablet computer, a mobile phone, or the like.
  • the electronic device may be an electronic device newly added to the system architecture of FIG. 1 to implement the solution of the embodiment of the present application (that is, the method shown in FIG. 2), and send the trained language model to the image
  • the speech processing server shown in 1 enables the speech processing server to use the language model obtained by training to perform speech recognition.
  • the electronic device may also be an improvement to a device in the architecture of FIG. 1 to be able to implement the method of the embodiment of the present application.
  • the electronic device may be an improvement to the voice processing server in the architecture of FIG. 1 to be able to implement the solution of the embodiment of the present application, and to obtain a language model through training, so that the voice processing server can perform processing through the language model obtained through training. Text recognition.
  • the target keywords in the corresponding professional field are used as Criteria for mining new words and provide a method to determine target keywords.
  • the obtaining the target keyword from the target document includes:
  • the data processing method can be applied to a simultaneous interpretation scene in a conference, and the target document refers to a text describing a related technology in a certain professional field.
  • the target document may be a document demonstrated in a conference (such as a technical seminar).
  • the format of the document is not limited, and the format of the document may be presentation software (PPT, PowerPoint), Word, etc.
  • the target document may also be other documents in a corresponding professional field, for example, documents presented in other conferences.
  • the target document may also be a document storing professional vocabulary in a corresponding professional field.
  • the data processing method provided in this embodiment can be pre-operation before simultaneous interpretation (that is, before text recognition is performed using a language model), that is, model training is performed in advance based on the target document to obtain a language for a specific professional field. model;
  • the data processing method can also be carried out in the process of simultaneous interpretation
  • the data processing method may also be performed after a certain simultaneous interpretation, and the obtained language model is used to prepare for the next simultaneous interpretation in the corresponding professional field.
  • screening at least one word obtained from the target document includes:
  • the electronic device may perform text cleaning, sentence segmentation, and text normalization processing on the target document, and then perform word segmentation based on each sentence obtained, thereby obtaining a word segmentation result.
  • the word segmentation result includes: at least one word.
  • the text cleaning includes at least one of the following:
  • the screening of at least one word obtained from the target document includes:
  • a list of candidate keywords is generated.
  • the stop word list may be preset.
  • the stop word list may include conventional stop words, the stop words represent words that indicate pauses during speech, modal auxiliary words, etc., which usually have no clear meaning by themselves;
  • the stop words may include: this, this, etc.
  • the stop word list may also include: words that the user wants to filter out and will not become target words.
  • the target part of speech may include: verbs, nouns, and so on.
  • the using candidate keywords in the candidate keyword list that meet the first preset condition as the target keyword includes:
  • the word correlation represents the correlation between the corresponding candidate keyword and other candidate keywords in the candidate keyword list
  • the candidate keywords whose word relevance exceeds a first preset threshold in the candidate keyword list are used as the target keywords.
  • the first preset threshold may be preset and saved by the developer.
  • the first preset threshold is set based on the accuracy requirement of target word extraction, and the higher the accuracy requirement, the higher the first preset threshold.
  • the accuracy rate may characterize the relevance to the target professional field, and the higher the accuracy rate, the greater the relevance to the target professional field.
  • the target professional field refers to the professional field corresponding to the target document, that is, the professional field in which new word mining is required.
  • step 202 the obtaining related webpage documents according to the target keywords includes:
  • a web crawler is used to obtain webpage documents related to the target keyword.
  • the determining at least one target word from the webpage document includes:
  • Screen at least one word obtained from the web page document, and generate a candidate target word list according to the word obtained after screening;
  • the candidate target words that meet the second preset condition in the candidate target word list are used as the target words.
  • the step of using candidate target words that meet a second preset condition in the candidate target word list as the target words includes:
  • the candidate target words in the candidate target word list whose relevance exceeds a second preset threshold and do not belong to the dictionary are used as target words.
  • the second preset threshold may be preset and saved by the developer.
  • the second preset threshold is set based on the accuracy requirement of target word extraction, and the higher the accuracy requirement, the higher the second preset threshold.
  • the accuracy rate may characterize the relevance to the target professional field, and the higher the accuracy rate, the greater the relevance to the target professional field.
  • word correlation calculation described above can use any word correlation calculation method, which is not limited here.
  • word correlation calculation method which is not limited here.
  • word relevance uses the following formula to calculate word relevance:
  • x represents a candidate target word for word relevance calculation
  • Y represents all words in the candidate target word list
  • the word segmentation of the webpage document can be performed in the same manner as the target document, that is, after the webpage document is processed for text cleaning, sentence segmentation, and text normalization, the word segmentation is performed based on each sentence obtained, thereby A word segmentation result is obtained, and the word segmentation result includes at least one word.
  • HTML HyperText Markup Language
  • HTML HyperText Markup Language
  • HTMLPARSER of Python
  • HTMLPARSER can convert these entities into standard HTML tags. For example: Convert "&lt" to " ⁇ ".
  • the text cleaning for the webpage document can also be performed by using the above-mentioned method for text cleaning the target document.
  • the text cleaning of web documents may also include at least one of the following:
  • At least one word obtained from a web page document is noisy, if the at least one word is directly used to mine new words, the accuracy of the new words mined is not high. At least one word obtained is filtered.
  • the screening of at least one word obtained from a web page document includes:
  • the filtering of the at least one word obtained from the webpage document includes:
  • a list of candidate target words is generated.
  • the stop word list may be preset.
  • the stop word list may include conventional stop words, the stop words represent words that indicate pauses during speech, modal auxiliary words, etc., which usually have no clear meaning by themselves;
  • the stop words may include: this, this, etc.
  • the stop word list may also include: words that the user wants to filter out and will not become target words;
  • the target part of speech may include: verbs, nouns, and so on.
  • the language model is performed based on the updated dictionary, and a language model that can recognize various vocabulary in the corresponding professional field can be obtained.
  • step 203 the training a language model using the updated dictionary includes:
  • the dictionary includes basic words required for language modeling.
  • the dictionary is updated, and the updated dictionary may also include: professional vocabulary for a specific professional field.
  • the language model obtained by using the updated dictionary training can accurately recognize the voice data in the above-mentioned specific professional field.
  • the preset second language model may be a pre-trained or acquired general language model, and the second language model is interpolated with the language model of a specific professional field (ie, the first language model), Obtain the interpolated language model (that is, realize the merging of the first language model and the second language model to obtain the merged language model).
  • the dictionary is updated, so it is necessary to obtain a corpus corresponding to the updated dictionary for language training.
  • the method further includes:
  • a corpus corresponding to the dictionary is generated.
  • the web crawler is a program or script that automatically crawls World Wide Web information according to certain rules.
  • the words in the dictionary must have pronunciations. Therefore, in the embodiment of this application, after the target word is determined, the pronunciation of the word needs to be further determined, so that the language model obtained based on dictionary training can be performed Speech Recognition.
  • the updating the dictionary according to the at least one target word includes:
  • the dictionary is updated according to the at least one target word and the pronunciation of each target word in the at least one target word.
  • the target document can correspond to any language
  • the web page document can correspond to any language
  • the obtained language model can be used to perform voice data in any language. Text recognition, get the recognition result of any language.
  • the solution of the embodiment of this application adopts a vocabulary adaptive method, uses keywords in documents and search crawlers to obtain corpus in related fields, and obtains dictionaries and language models in specific fields by performing new word discovery and adaptive technologies.
  • the recognition of professional vocabulary effectively solves the problem of low recognition rate of professional vocabulary and the cost of a large amount of manual annotation corpus for model update in specific domains.
  • the embodiment of the application proposes a vocabulary adaptive simultaneous interpretation implementation method, using keyword extraction, search crawlers, new word discovery, model adaptive technology, can effectively extract professional vocabulary in a specific field, and improve professional vocabulary in a specific field The recognition effect.
  • FIG. 3 is a schematic diagram of another flow chart of a data processing method according to an embodiment of the application; the method is applied to an electronic device, as shown in FIG. 3, the method includes:
  • Step 301 Obtain a given simultaneous interpretation presentation document, and use a keyword extraction method to obtain a keyword list based on the presentation document.
  • the use of the keyword extraction method to obtain the keyword list includes:
  • the filtering of the word segmentation set includes:
  • a candidate keyword list is generated according to the candidate keywords.
  • the target part of speech may include nouns and verbs.
  • any word correlation calculation method can be used to perform word correlation calculation.
  • the word correlation calculation may include:
  • x represents candidate keywords
  • Y is a full list of candidate keywords
  • Y includes all candidate keywords.
  • the generating a keyword list based on words whose word relevance exceeds a preset threshold includes:
  • presentation document is equivalent to the target document described in the method in FIG. 2; operations performed on the presentation document can refer to the method shown in FIG. 2, which will not be repeated here.
  • Step 302 Crawling related webpage documents through the Internet based on the keywords in the obtained keyword list.
  • the step 302 includes:
  • Step 303 Obtain a new word list based on the web page document, and merge the words in the new word list with the general dictionary to obtain the merged dictionary.
  • the obtaining a list of new words based on the webpage document includes:
  • Screen at least one word obtained from the web document, and generate a list of candidate new words according to the word obtained after screening;
  • a new word list is obtained.
  • the new word is equivalent to the target word in the method shown in FIG. 2.
  • Step 304 Obtain a first language model according to the fusion dictionary training, and perform interpolation processing on the first language model and the second language model to obtain a model-adapted language model.
  • the obtained language model after model adaptation can be used for speech recognition in the corresponding professional field.
  • the model adaptation refers to: interpolating a well-trained general language model (ie, the second language model) and a specific domain language model (ie, the first language model) to obtain a The language model of simultaneous interpretation of the domain (that is, the professional domain corresponding to the new word in the new word list).
  • Figure 4 is a schematic flow diagram of another data processing method according to an embodiment of the application. As shown in Figure 4, the method mainly includes keyword extraction, web crawler acquisition of web pages, new word discovery, dictionary adaptation, and language model adaptation. .
  • the data processing method includes:
  • Interpolating the first language model and the general model to obtain a model-adapted language model, and using the model-adapted language model for speech recognition can improve the recognition effect of professional vocabulary in a specific professional field.
  • the data processing method can be applied to electronic equipment.
  • the electronic device may include: a server, a mobile terminal, and so on.
  • FIG. 5 is a schematic diagram of the composition structure of a data processing device according to an embodiment of the application; as shown in FIG. 5, the data processing device includes:
  • the obtaining unit 51 is configured to obtain target keywords from the target document
  • the first processing unit 52 is configured to obtain a related webpage document according to the target keyword, and determine at least one target word from the webpage document; the target word does not belong to a preset dictionary;
  • the second processing unit 53 is configured to update the dictionary according to the at least one target word, and use the updated dictionary to train a language model;
  • the language model is used to perform text recognition on voice data including the target keyword or the target word to obtain a recognition result; the recognition result is used to present the voice data when the voice data is played.
  • the second processing unit 53 is configured to use the updated dictionary and the corpus corresponding to the dictionary to perform model training to obtain a first language model
  • the second processing unit 53 is further configured to use a web crawler to obtain the corpus corresponding to each word in the updated dictionary;
  • a corpus corresponding to the dictionary is generated.
  • the second processing unit 53 is configured to determine the pronunciation of each target word in the at least one target word
  • the dictionary is updated according to the at least one target word and the pronunciation of each target word in the at least one target word.
  • the obtaining unit 51 is configured to obtain a target document
  • the acquiring unit 51 is configured to perform word correlation calculation for each candidate keyword in the candidate keyword list; the word correlation characterizes the corresponding candidate keyword and the candidate keyword list Correlation among other candidate keywords in
  • the candidate keywords whose word relevance exceeds a first preset threshold in the candidate keyword list are used as the target keywords.
  • the first preset threshold is preset and saved by the developer.
  • the first processing unit 52 is configured to segment the webpage document to obtain at least one word
  • Screen at least one word obtained from the web page document, and generate a candidate target word list according to the word obtained after screening;
  • the candidate target words that meet the second preset condition in the candidate target word list are used as the target words.
  • the second preset threshold is preset and saved by the developer.
  • the first processing unit 52 configured to use as the target word candidate target words that meet the second preset condition in the candidate target word list includes:
  • the candidate target words in the candidate target word list whose relevance exceeds a second preset threshold and do not belong to the dictionary are used as target words.
  • the acquisition unit 51, the first processing unit 52, and the second processing unit 53 can all be operated by processors in the electronic devices (such as servers, mobile terminals), such as central processing units (CPUs, Central Processing Unit, Digital Signal Processor (DSP, Digital Signal Processor), Microcontroller Unit (MCU, Microcontroller Unit), or Programmable Gate Array (FPGA, Field-Programmable Gate Array) and other implementations.
  • processors in the electronic devices such as servers, mobile terminals
  • CPUs Central Processing Unit
  • DSP Digital Signal Processor
  • MCU Microcontroller Unit
  • FPGA Field-Programmable Gate Array
  • the device provided in the above embodiment performs data processing
  • only the division of the above-mentioned program modules is used as an example.
  • the above-mentioned processing can be allocated by different program modules as needed, that is, the terminal
  • the internal structure is divided into different program modules to complete all or part of the processing described above.
  • the device provided in the foregoing embodiment and the data processing method embodiment belong to the same concept, and the specific implementation process is detailed in the method embodiment, which will not be repeated here.
  • FIG. 6 is a schematic diagram of the hardware composition structure of the electronic device according to the embodiment of the application.
  • the electronic device 60 includes a memory 63 and a processor. 62 and a computer program stored on the memory 63 and capable of running on the processor 62; the processor 62 located in the electronic device executes the program to implement the method provided by one or more technical solutions on the electronic device side.
  • the processor 62 located in the electronic device 60 executes the program, it realizes: obtain the target keyword from the target document;
  • the language model is used to perform text recognition on voice data including the target keyword or the target word to obtain a recognition result; the recognition result is used to present the voice data when the voice data is played.
  • the processor 62 located in the network device 60 executes the program, it is implemented: use the updated dictionary and the corpus corresponding to the dictionary to perform model training to obtain the first language model;
  • the processor 62 located in the network device 60 executes the program, it is implemented: using a web crawler to obtain the corpus corresponding to each word in the updated dictionary;
  • a corpus corresponding to the dictionary is generated.
  • the processor located in the network device 60 executes the program, the following is achieved: determining the pronunciation of each target word in the at least one target word;
  • the dictionary is updated according to the at least one target word and the pronunciation of each target word in the at least one target word.
  • the processor 62 located in the network device 60 executes the program, the following is achieved: obtaining the target document;
  • the processor 62 located in the network device 60 executes the program, it is realized: for each candidate keyword in the candidate keyword list, the word correlation calculation is performed; the word correlation represents the corresponding candidate keyword The correlation with other candidate keywords in the candidate keyword list;
  • the candidate keywords whose word relevance exceeds a first preset threshold in the candidate keyword list are used as the target keywords.
  • processor 62 located in the network device 60 executes the program, it implements: segmenting the webpage document to obtain at least one word;
  • Screen at least one word obtained from the web page document, and generate a candidate target word list according to the word obtained after screening;
  • the candidate target words that meet the second preset condition in the candidate target word list are used as the target words.
  • the processor 62 located in the network device 60 executes the program, it is realized: for each candidate target word in the candidate target word list, the word correlation calculation is performed; the word correlation characterizes the corresponding candidate target word Correlation with other candidate target words in the candidate target word list;
  • the candidate target words in the candidate target word list whose relevance exceeds a second preset threshold and do not belong to the dictionary are used as target words.
  • the electronic device further includes a communication interface 61; various components in the electronic device are coupled together through the bus system 64.
  • the bus system 64 is configured to implement connection and communication between these components.
  • the bus system 64 also includes a power bus, a control bus, and a status signal bus.
  • the memory 63 in this embodiment may be a volatile memory or a non-volatile memory, and may also include both volatile and non-volatile memory.
  • the non-volatile memory can be a read-only memory (ROM, Read Only Memory), a programmable read-only memory (PROM, Programmable Read-Only Memory), an erasable programmable read-only memory (EPROM, Erasable Programmable Read- Only Memory, Electrically Erasable Programmable Read-Only Memory (EEPROM), Ferromagnetic Random Access Memory (FRAM), Flash Memory, Magnetic Surface Memory , CD-ROM, or CD-ROM (Compact Disc Read-Only Memory); magnetic surface memory can be magnetic disk storage or tape storage.
  • the volatile memory may be a random access memory (RAM, Random Access Memory), which is used as an external cache.
  • RAM random access memory
  • SRAM static random access memory
  • SSRAM synchronous static random access memory
  • Synchronous Static Random Access Memory Synchronous Static Random Access Memory
  • DRAM Dynamic Random Access Memory
  • SDRAM Synchronous Dynamic Random Access Memory
  • DDRSDRAM Double Data Rate Synchronous Dynamic Random Access Memory
  • ESDRAM Enhanced Synchronous Dynamic Random Access Memory
  • SLDRAM synchronous connection dynamic random access memory
  • DRRAM Direct Rambus Random Access Memory
  • the memories described in the embodiments of the present application are intended to include, but are not limited to, these and any other suitable types of memories.
  • the method disclosed in the foregoing embodiments of the present application may be applied to the processor 62 or implemented by the processor 62.
  • the processor 62 may be an integrated circuit chip with signal processing capability. In the implementation process, the steps of the foregoing method may be completed by an integrated logic circuit of hardware in the processor 62 or instructions in the form of software.
  • the aforementioned processor 62 may be a general-purpose processor, a DSP, or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, and the like.
  • the processor 62 may implement or execute various methods, steps, and logical block diagrams disclosed in the embodiments of the present application.
  • the general-purpose processor may be a microprocessor or any conventional processor or the like.
  • the steps of the method disclosed in the embodiments of the present application may be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor.
  • the software module may be located in a storage medium, and the storage medium is located in a memory.
  • the processor 62 reads the information in the memory and completes the steps of the foregoing method in combination with its hardware.
  • the embodiment of the present application also provides a storage medium, which is specifically a computer storage medium, and more specifically, a computer-readable storage medium.
  • a storage medium which is specifically a computer storage medium, and more specifically, a computer-readable storage medium.
  • Stored thereon are computer instructions, that is, a computer program, which is a method provided by one or more technical solutions on the electronic device side when the computer instructions are executed by a processor.
  • the disclosed method and smart device can be implemented in other ways.
  • the device embodiments described above are merely illustrative.
  • the division of the units is only a logical function division, and there may be other divisions in actual implementation, such as: multiple units or components can be combined, or It can be integrated into another system, or some features can be ignored or not implemented.
  • the coupling, or direct coupling, or communication connection between the components shown or discussed may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms. of.
  • the units described above as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units; Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the embodiments of the present application may all be integrated into a second processing unit, or each unit may be individually used as a unit, or two or more units may be integrated into one unit;
  • the above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional units.
  • the foregoing program can be stored in a computer readable storage medium. When the program is executed, it is executed. Including the steps of the foregoing method embodiment; and the foregoing storage medium includes: various media that can store program codes, such as a mobile storage device, ROM, RAM, magnetic disk, or optical disk.
  • the aforementioned integrated unit of the present application is implemented in the form of a software function module and sold or used as an independent product, it may also be stored in a computer readable storage medium.
  • the computer software product is stored in a storage medium and includes several instructions for A computer device (which may be a personal computer, a server, or a network device, etc.) is allowed to execute all or part of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: removable storage devices, ROM, RAM, magnetic disks, or optical disks and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

L'invention concerne un procédé de traitement de données, un appareil, un dispositif électronique et support d'enregistrement. Le procédé de traitement de données consiste à : obtenir un mot-clé cible d'un document cible (201) ; obtenir un document de page Web pertinent en fonction du mot-clé cible et déterminer au moins un mot cible dudit document de page Web, le mot cible n'appartenant pas à un dictionnaire prédéfini (202) ; mettre à jour le dictionnaire en fonction dudit mot cible et utiliser le dictionnaire mis à jour pour entraîner un modèle de langage ; ledit modèle de langage est utilisé pour la reconnaissance de texte sur des données vocales contenant le mot-clé cible ou le mot cible, et obtenir un résultat de reconnaissance ; le résultat de reconnaissance apparaît lorsque les données vocales sont lues (203).
PCT/CN2019/130650 2019-12-31 2019-12-31 Procédé de traitement de données, appareil, dispositif électronique et support d'enregistrement WO2021134524A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2019/130650 WO2021134524A1 (fr) 2019-12-31 2019-12-31 Procédé de traitement de données, appareil, dispositif électronique et support d'enregistrement
CN201980101007.3A CN114556328B (zh) 2019-12-31 2019-12-31 数据处理方法、装置、电子设备和存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/130650 WO2021134524A1 (fr) 2019-12-31 2019-12-31 Procédé de traitement de données, appareil, dispositif électronique et support d'enregistrement

Publications (1)

Publication Number Publication Date
WO2021134524A1 true WO2021134524A1 (fr) 2021-07-08

Family

ID=76686075

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/130650 WO2021134524A1 (fr) 2019-12-31 2019-12-31 Procédé de traitement de données, appareil, dispositif électronique et support d'enregistrement

Country Status (2)

Country Link
CN (1) CN114556328B (fr)
WO (1) WO2021134524A1 (fr)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113743112A (zh) * 2021-08-24 2021-12-03 北京百度网讯科技有限公司 关键词的提取方法、装置、电子设备及可读存储介质
CN114170856A (zh) * 2021-12-06 2022-03-11 网易有道信息技术(北京)有限公司 用机器实施的听力训练方法、设备及可读存储介质
CN114186552A (zh) * 2021-12-13 2022-03-15 北京百度网讯科技有限公司 文本分析方法、装置、设备及计算机存储介质
CN115344787A (zh) * 2022-08-23 2022-11-15 华南师范大学 一种多粒度推荐方法、系统、装置及存储介质
CN115618397A (zh) * 2022-12-19 2023-01-17 深圳市研强物联技术有限公司 一种录音笔语音加密方法

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115394293A (zh) * 2022-08-08 2022-11-25 湖北星纪时代科技有限公司 对话系统和用于实现对话的方法
CN115563375A (zh) * 2022-09-29 2023-01-03 北京海泰方圆科技股份有限公司 一种文档索引的更新方法、装置、设备和介质
CN116108834A (zh) * 2023-04-10 2023-05-12 中国民用航空飞行学院 交互式用户词典构建方法、装置和设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106033462A (zh) * 2015-03-19 2016-10-19 科大讯飞股份有限公司 一种新词发现方法及系统
US20180067920A1 (en) * 2016-09-06 2018-03-08 Kabushiki Kaisha Toshiba Dictionary updating apparatus, dictionary updating method and computer program product
CN108804512A (zh) * 2018-04-20 2018-11-13 平安科技(深圳)有限公司 文本分类模型的生成装置、方法及计算机可读存储介质
CN108920473A (zh) * 2018-07-04 2018-11-30 中译语通科技股份有限公司 一种基于同类词与同义词替换的数据增强机器翻译方法
CN109783649A (zh) * 2019-01-02 2019-05-21 腾讯科技(深圳)有限公司 一种领域词典生成方法及装置

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8209164B2 (en) * 2007-11-21 2012-06-26 University Of Washington Use of lexical translations for facilitating searches
CN103778215B (zh) * 2014-01-17 2016-08-17 北京理工大学 一种基于情感分析和隐马尔科夫模型融合的股市预测方法
CN110308931B (zh) * 2019-06-20 2024-06-07 平安科技(深圳)有限公司 一种数据处理方法及相关装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106033462A (zh) * 2015-03-19 2016-10-19 科大讯飞股份有限公司 一种新词发现方法及系统
US20180067920A1 (en) * 2016-09-06 2018-03-08 Kabushiki Kaisha Toshiba Dictionary updating apparatus, dictionary updating method and computer program product
CN108804512A (zh) * 2018-04-20 2018-11-13 平安科技(深圳)有限公司 文本分类模型的生成装置、方法及计算机可读存储介质
CN108920473A (zh) * 2018-07-04 2018-11-30 中译语通科技股份有限公司 一种基于同类词与同义词替换的数据增强机器翻译方法
CN109783649A (zh) * 2019-01-02 2019-05-21 腾讯科技(深圳)有限公司 一种领域词典生成方法及装置

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113743112A (zh) * 2021-08-24 2021-12-03 北京百度网讯科技有限公司 关键词的提取方法、装置、电子设备及可读存储介质
CN113743112B (zh) * 2021-08-24 2023-09-12 北京百度网讯科技有限公司 关键词的提取方法、装置、电子设备及可读存储介质
CN114170856A (zh) * 2021-12-06 2022-03-11 网易有道信息技术(北京)有限公司 用机器实施的听力训练方法、设备及可读存储介质
CN114170856B (zh) * 2021-12-06 2024-03-12 网易有道信息技术(北京)有限公司 用机器实施的听力训练方法、设备及可读存储介质
CN114186552A (zh) * 2021-12-13 2022-03-15 北京百度网讯科技有限公司 文本分析方法、装置、设备及计算机存储介质
CN114186552B (zh) * 2021-12-13 2023-04-07 北京百度网讯科技有限公司 文本分析方法、装置、设备及计算机存储介质
CN115344787A (zh) * 2022-08-23 2022-11-15 华南师范大学 一种多粒度推荐方法、系统、装置及存储介质
CN115344787B (zh) * 2022-08-23 2023-07-04 华南师范大学 一种多粒度推荐方法、系统、装置及存储介质
CN115618397A (zh) * 2022-12-19 2023-01-17 深圳市研强物联技术有限公司 一种录音笔语音加密方法

Also Published As

Publication number Publication date
CN114556328A (zh) 2022-05-27
CN114556328B (zh) 2024-07-16

Similar Documents

Publication Publication Date Title
WO2021134524A1 (fr) Procédé de traitement de données, appareil, dispositif électronique et support d'enregistrement
US11455542B2 (en) Text processing method and device based on ambiguous entity words
US10176804B2 (en) Analyzing textual data
KR101130444B1 (ko) 기계번역기법을 이용한 유사문장 식별 시스템
JP6901816B2 (ja) エンティティ関係データ生成方法、装置、機器、及び記憶媒体
WO2021000497A1 (fr) Procédé et appareil de récupération, dispositif informatique et support de stockage
US8577882B2 (en) Method and system for searching multilingual documents
JP6361351B2 (ja) 発話ワードをランク付けする方法、プログラム及び計算処理システム
CN110276023B (zh) Poi变迁事件发现方法、装置、计算设备和介质
US20190197166A1 (en) Method, terminal device and storage medium for mining entity description tag
WO2018045646A1 (fr) Procédé et dispositif à base d'intelligence artificielle pour interaction humain-machine
JP6693582B2 (ja) 文書要約の生成方法、装置、電子機器、コンピュータ読み取り可能な記憶媒体
CN106570180A (zh) 基于人工智能的语音搜索方法及装置
CN109271624B (zh) 一种目标词确定方法、装置及存储介质
CN107861948B (zh) 一种标签提取方法、装置、设备和介质
CN110032734B (zh) 近义词扩展及生成对抗网络模型训练方法和装置
CN110750627A (zh) 一种素材的检索方法、装置、电子设备及存储介质
CN111126084B (zh) 数据处理方法、装置、电子设备和存储介质
JP2023002690A (ja) セマンティックス認識方法、装置、電子機器及び記憶媒体
CN116842168B (zh) 跨领域问题处理方法、装置、电子设备及存储介质
CN112199954B (zh) 基于语音语义的疾病实体匹配方法、装置及计算机设备
CN110688558B (zh) 网页搜索的方法、装置、电子设备和存储介质
WO2021097629A1 (fr) Procédé et appareil de traitement de données, et dispositif électronique et support de stockage
CN115858776A (zh) 一种变体文本分类识别方法、系统、存储介质和电子设备
CN113741864A (zh) 基于自然语言处理的语义化服务接口自动设计方法与系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19958187

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 291122)

122 Ep: pct application non-entry in european phase

Ref document number: 19958187

Country of ref document: EP

Kind code of ref document: A1