CN110825913A - Professional word extraction and part-of-speech tagging method - Google Patents
Professional word extraction and part-of-speech tagging method Download PDFInfo
- Publication number
- CN110825913A CN110825913A CN201910841201.7A CN201910841201A CN110825913A CN 110825913 A CN110825913 A CN 110825913A CN 201910841201 A CN201910841201 A CN 201910841201A CN 110825913 A CN110825913 A CN 110825913A
- Authority
- CN
- China
- Prior art keywords
- words
- industry
- extracting
- characteristic words
- speech
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/7867—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3334—Selection or weighting of terms from queries, including natural language queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/73—Querying
- G06F16/735—Filtering based on additional data, e.g. user or group profiles
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Library & Information Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method for extracting professional words and labeling parts of speech, which comprises the following steps: s1: establishing a keyword tag database, and storing a corresponding relation between the keywords and the industry characteristic words; s2: extracting key words from the query instruction; s3: matching the industry characteristic words corresponding to the keywords obtained in the step S2 based on the keyword tag database obtained in the step S1; s4: screening the video files based on the industry feature words obtained in the S3; s5: and marking the industry characteristic words in the video file obtained by screening. The invention can extract the industry characteristic words from the video file and acquire the text information associated with the general industry characteristic words corresponding to the keywords. The retrieval accuracy is greatly improved, the workload is reduced, and the working efficiency is improved.
Description
Technical Field
The invention belongs to the technical field of voice recognition, and particularly relates to a method for extracting professional words and labeling parts of speech of a video file.
Background
The AR technology, also called augmented display technology, is a new technology that implements analog simulation processing of physical information that is difficult to experience in the real space range on the basis of scientific technologies such as computers, and achieves super-real sensory experience by superimposing virtual information contents in the real world and making this process perceived by human senses. In recent years, AR technology has been widely used in the fields of industry, video, medicine, education, and the like. In the process of extracting professional words, the conventional AR technology mainly obtains a domain term by calculating the coupling degree between adjacent words, but the method needs to calculate the coupling degree of all words in a corpus and has low accuracy; and if the mode of manually marking all the professional terms is adopted, the problems of large workload and low efficiency exist. Therefore, how to develop a new method for extracting professional words and labeling parts of speech in the AR field to overcome the above problems, improve the accuracy of extraction, improve the work efficiency, and reduce the workload is a direction that needs to be researched by those skilled in the art.
Disclosure of Invention
The invention aims to provide a method for extracting professional words and labeling parts of speech, which can improve the extraction accuracy of video files related to keywords, reduce the extraction workload and improve the extraction efficiency.
The technical scheme adopted is as follows:
a method for extracting professional words and labeling parts of speech comprises the following steps: s1: establishing a keyword tag database, and storing a corresponding relation between the keywords and the industry characteristic words; s2: extracting key words from the query instruction; s3: matching the industry characteristic words corresponding to the keywords obtained in the step S2 based on the keyword tag database obtained in the step S1; s4: screening the video files based on the industry feature words obtained in the S3; s5: and marking the industry characteristic words in the video file obtained by screening.
By adopting the technical scheme: the corresponding industry characteristic words are matched by the keywords in the query instruction through the keyword tag database which is preset in the industry characteristic word one-to-one corresponding matching, each video file is automatically screened according to the industry characteristic words queried at this time, and the industry characteristic words in the screening result are correspondingly annotated, so that the extraction workload is reduced, and the extraction efficiency is improved.
Preferably, the method for extracting professional words and tagging parts of speech further includes step S6, and step S6 includes: and adding retrieval tags to the video files obtained in the step S4 and sorting the video files according to the time sequence.
By adopting the technical scheme: by sequencing the video files based on the appearance time of the industry feature words, the accuracy of screening the video files is improved, and a user can be ensured to preferentially search the video files with higher relevance according to the time sequence.
More preferably, in the method for extracting professional words and labeling parts of speech, the method comprises the following steps: the step S1 includes the following steps: s11: collecting industry characteristic words according to industry safety regulations and operation manuals; s12: inputting the industry characteristic words into a keyword tag database and uniformly formatting; s13: marking corresponding keyword labels on the characteristic words of each industry; s14: and marking corresponding weight value labels on the industry feature words based on the occurrence frequency of the industry feature words, and sequencing the industry feature words according to the weight value labels.
By adopting the technical scheme: the vocabulary of the universal industry feature words is constructed based on the safety regulations of various enterprises in the universal industry, the operation manual of equipment manufacturers and the like, and the method is flexible in feature extraction and high in accuracy. The retrieval accuracy is further improved by giving weight values to the words and sequencing the words based on frequency statistics of the industry feature words.
Further preferably, in the method for extracting professional words and tagging parts of speech, the step S4 includes the steps of: s41: intercepting an audio track file in a video file; s42: converting the audio track file obtained in the step S21 into a text description file; s43: performing word segmentation processing on the text description file, and splitting the text description file into a plurality of words; s44: and screening out the video files corresponding to the text description files containing the industry characteristic words in the word segmentation.
Still more preferably, in the method for extracting professional words and tagging parts of speech, the method comprises: and step S43, the word segmentation processing is realized based on a distributed crawler platform.
Still more preferably, in the method for extracting professional words and labeling parts of speech, in step S5, the industry feature words in the video file obtained by screening are displayed in a color-mixing manner and displayed in a brightness-mixing manner. And the label display adopts color modulation display and/or brightness modulation display.
Compared with the prior art, the method and the device can extract the industry characteristic words from the video file and acquire the text information associated with the general industry characteristic words corresponding to the keywords. The retrieval accuracy is greatly improved, the workload is reduced, and the working efficiency is improved.
Drawings
The invention is described in further detail in the following description of embodiments with reference to the accompanying drawings:
fig. 1 is a schematic flow chart of embodiment 1 of the present invention.
Detailed Description
In order to more clearly illustrate the technical solution of the present invention, the above description will be further described with reference to various embodiments.
FIG. 1 shows example 1 of the present invention:
a method for extracting professional words and labeling parts of speech comprises the following steps:
s11: collecting industry characteristic words according to industry safety regulations and operation manuals;
s12: inputting the industry characteristic words into a keyword tag database and uniformly formatting;
s13: marking corresponding keyword labels on the characteristic words of each industry;
s14: marking corresponding weight value labels on the industry feature words based on the appearance frequency of the industry feature words, and sequencing the industry feature words according to the weight value labels;
s2: extracting key words from the query instruction;
s3: matching the industry characteristic words corresponding to the keywords obtained in the step S2 based on the keyword tag database obtained in the step S1;
s41: intercepting an audio track file in a video file;
s42: converting the audio track file obtained in the step S21 into a text description file;
s43: performing word segmentation processing on the text description file, and splitting the text description file into a plurality of words;
s44: screening out a video file corresponding to a text description file containing an industry feature word in the word segmentation;
s5: and performing color modulation display and brightness modulation display on the industry characteristic words in the video file obtained by screening.
S6: and adding retrieval tags to the video files obtained in the step S4 and sorting the video files according to the time sequence.
In the above embodiment: and step S43, the word segmentation processing is realized based on a distributed crawler platform.
In the above technical scheme: the similarity between the text description of the screened video clip and the text expression of the natural language expressed scene can be selected and compared, and a key frame set of the input text of the scene which accords with the natural language expression in the content is output; identifying and extracting objects in the key frame set to generate an object set; and finally generating a key frame according to the scene graph and the object set to generate the video. Compared with the keywords, the video retrieval method based on natural language processing greatly reduces the ambiguity of description, so that the system can more efficiently filter and find out the matched video. Meanwhile, the extraction of tens of important fields such as the title, the text and the like of the text is extracted from the centralized control regulation, the overhaul regulation, the user operation manual and the product specification used by the industry company, and the customized extraction service of the special type webpage is adopted. The extraction background completes the normalization and structuralization processing work of the webpage content, and a user can efficiently complete the acquisition of rich structuralization information from a specified page only by calling an extraction API. The method has flexible feature extraction and high accuracy; according to the scheme, the accuracy is not required to be verified manually, and the information extraction speed is increased.
The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. The protection scope of the present invention is subject to the protection scope of the claims.
Claims (6)
1. A method for extracting professional words and labeling parts of speech is characterized by comprising the following steps:
s1: establishing a keyword tag database, and storing a corresponding relation between the keywords and the industry characteristic words;
s2: extracting key words from the query instruction;
s3: matching the industry characteristic words corresponding to the keywords obtained in the step S2 based on the keyword tag database obtained in the step S1;
s4: screening the video files based on the industry feature words obtained in the S3;
s5: and marking and displaying the industry characteristic words in the video file obtained by screening.
2. The method for extracting specialized words and labeling parts of speech as claimed in claim 1, further comprising a step S6, wherein the step S6 comprises: and adding retrieval tags to the video files obtained in the step S4 and sorting the video files according to the time sequence.
3. The method for extracting specialized words and labeling parts of speech as claimed in claim 1, wherein: the step S1 includes the following steps:
s11: collecting industry characteristic words according to industry safety regulations and operation manuals;
s12: inputting the industry characteristic words into a keyword tag database and uniformly formatting;
s13: marking corresponding keyword labels on the characteristic words of each industry;
s14: and marking corresponding weight value labels on the industry characteristic words based on the occurrence frequency of the industry characteristic words, and sequencing the industry characteristic words according to the weight value labels.
4. The method for extracting specialized words and labeling parts of speech as claimed in claim 1, wherein said step S4 comprises the steps of:
s41: intercepting an audio track file in a video file;
s42: converting the audio track file obtained in the step S21 into a text description file;
s43: performing word segmentation processing on the text description file, and splitting the text description file into a plurality of words;
s44: and screening out the video files corresponding to the text description files containing the industry characteristic words in the word segmentation.
5. The method for extracting specialized words and labeling parts of speech as claimed in claim 1, wherein: and step S43, the word segmentation processing is realized based on a distributed crawler platform.
6. The method for extracting specialized words and labeling parts of speech according to claim 1, wherein the label display in step S5 is a color-modulation display and/or a brightness-modulation display.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910841201.7A CN110825913A (en) | 2019-09-03 | 2019-09-03 | Professional word extraction and part-of-speech tagging method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910841201.7A CN110825913A (en) | 2019-09-03 | 2019-09-03 | Professional word extraction and part-of-speech tagging method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110825913A true CN110825913A (en) | 2020-02-21 |
Family
ID=69547927
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910841201.7A Pending CN110825913A (en) | 2019-09-03 | 2019-09-03 | Professional word extraction and part-of-speech tagging method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110825913A (en) |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101826084A (en) * | 2009-03-05 | 2010-09-08 | 深圳市万泉河科技有限公司 | Fast searching method for files, fast searching method for mass talent hiring on Internet and system |
CN102043812A (en) * | 2009-10-13 | 2011-05-04 | 北京大学 | Method and system for retrieving medical information |
CN103425742A (en) * | 2013-07-16 | 2013-12-04 | 北京中科汇联信息技术有限公司 | Method and device for searching website |
CN103678694A (en) * | 2013-12-26 | 2014-03-26 | 乐视网信息技术(北京)股份有限公司 | Method and system for establishing reverse index file of video resources |
CN106874443A (en) * | 2017-02-09 | 2017-06-20 | 北京百家互联科技有限公司 | Based on information query method and device that video text message is extracted |
CN107203616A (en) * | 2017-05-24 | 2017-09-26 | 苏州百智通信息技术有限公司 | The mask method and device of video file |
CN108241856A (en) * | 2018-01-12 | 2018-07-03 | 新华智云科技有限公司 | Information generation method and equipment |
CN108388583A (en) * | 2018-01-26 | 2018-08-10 | 北京览科技有限公司 | A kind of video searching method and video searching apparatus based on video content |
CN109101558A (en) * | 2018-07-12 | 2018-12-28 | 北京猫眼文化传媒有限公司 | A kind of video retrieval method and device |
-
2019
- 2019-09-03 CN CN201910841201.7A patent/CN110825913A/en active Pending
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101826084A (en) * | 2009-03-05 | 2010-09-08 | 深圳市万泉河科技有限公司 | Fast searching method for files, fast searching method for mass talent hiring on Internet and system |
CN102043812A (en) * | 2009-10-13 | 2011-05-04 | 北京大学 | Method and system for retrieving medical information |
CN103425742A (en) * | 2013-07-16 | 2013-12-04 | 北京中科汇联信息技术有限公司 | Method and device for searching website |
CN103678694A (en) * | 2013-12-26 | 2014-03-26 | 乐视网信息技术(北京)股份有限公司 | Method and system for establishing reverse index file of video resources |
CN106874443A (en) * | 2017-02-09 | 2017-06-20 | 北京百家互联科技有限公司 | Based on information query method and device that video text message is extracted |
CN107203616A (en) * | 2017-05-24 | 2017-09-26 | 苏州百智通信息技术有限公司 | The mask method and device of video file |
CN108241856A (en) * | 2018-01-12 | 2018-07-03 | 新华智云科技有限公司 | Information generation method and equipment |
CN108388583A (en) * | 2018-01-26 | 2018-08-10 | 北京览科技有限公司 | A kind of video searching method and video searching apparatus based on video content |
CN109101558A (en) * | 2018-07-12 | 2018-12-28 | 北京猫眼文化传媒有限公司 | A kind of video retrieval method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107633005B (en) | Knowledge graph construction and comparison system and method based on classroom teaching content | |
CN111046656B (en) | Text processing method, text processing device, electronic equipment and readable storage medium | |
CN111581990B (en) | Cross-border transaction matching method and device | |
WO2021135469A1 (en) | Machine learning-based information extraction method, apparatus, computer device, and medium | |
CN104809142A (en) | Trademark inquiring system and method | |
CN111190920B (en) | Data interaction query method and system based on natural language | |
CN116108857B (en) | Information extraction method, device, electronic equipment and storage medium | |
CN110674378A (en) | Chinese semantic recognition method based on cosine similarity and minimum editing distance | |
CN111428503A (en) | Method and device for identifying and processing same-name person | |
Duarte et al. | Heterogeneous data sources for signed language analysis and synthesis: The signcom project | |
CN117076693A (en) | Method for constructing digital human teacher multi-mode large language model pre-training discipline corpus | |
CN107688621B (en) | Method and system for optimizing file | |
CN111401044A (en) | Title generation method and device, terminal equipment and storage medium | |
CN111444720A (en) | Named entity recognition method for English text | |
CN114141384A (en) | Method, apparatus and medium for retrieving medical data | |
CN109766442A (en) | A kind of couple of user takes down notes the method and system classified | |
CN113609847A (en) | Information extraction method and device, electronic equipment and storage medium | |
CN110825913A (en) | Professional word extraction and part-of-speech tagging method | |
CN112417875A (en) | Configuration information updating method and device, computer equipment and medium | |
Chivadshetti et al. | Content based video retrieval using integrated feature extraction and personalization of results | |
CN111191413A (en) | Method, device and system for automatically marking event core content based on graph sequencing model | |
CN105224642B (en) | The abstracting method and device of entity tag | |
CN114996494A (en) | Image processing method, image processing device, electronic equipment and storage medium | |
CN114637831A (en) | Data query method based on semantic analysis and related equipment thereof | |
CN114328946A (en) | Hidden danger processing method based on knowledge graph |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |