CN111753137A - Video searching method based on voice characteristics - Google Patents
Video searching method based on voice characteristics Download PDFInfo
- Publication number
- CN111753137A CN111753137A CN202010604506.9A CN202010604506A CN111753137A CN 111753137 A CN111753137 A CN 111753137A CN 202010604506 A CN202010604506 A CN 202010604506A CN 111753137 A CN111753137 A CN 111753137A
- Authority
- CN
- China
- Prior art keywords
- voice
- neural network
- method based
- searching method
- video searching
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7834—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using audio features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Library & Information Science (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a video searching method based on voice characteristics, which comprises the following steps: extracting voice data to be searched into a plurality of multi-dimensional feature vectors; converting the extracted voice data of the multi-dimensional feature vector into a plurality of texts; importing the obtained texts into a time sequence neural network for training to obtain a feature vector of high latitude; the convolution neural network takes the feature vector of high latitude as input to carry out regression operation; comparing the characteristic vectors obtained by the convolutional neural network with the characteristic vectors corresponding to the database, selecting the result with the highest similarity, and testing to obtain the final accuracy; and outputting the selected result with the highest similarity by using video or audio, finishing the final search content, and feeding back the final search content to the user in a voice mode. The invention solves the technical problem that the accuracy of voice search is not high because the voice information change is tested badly and a label of enough data is difficult to be marked to ensure the precision of a search algorithm in the prior art.
Description
Technical Field
The invention relates to the field of artificial intelligence computer vision processing, in particular to a video searching method based on voice characteristics.
Background
The rapid development of the internet has promoted the economic and more intelligent traditional entity, and people are beginning to generate more demands close to life. As one of the mainstream communication applications of human-computer interaction in the intelligent terminal, the use frequency of voice interaction in reality is higher and higher, and the frequency of using the intelligent terminal to search multimedia resources is higher and higher. The existing intelligent terminal firstly converts voice into characters, fuzzy matching is carried out on the characters and the fields in the library file for searching, the corresponding semantics can be found by the fields in the database, and the corresponding semantics can not be found by the fields in the database. This method cannot search and recommend through the characteristics (voice, picture) of the multimedia resource itself, thus deteriorating the experience of the user who has such needs. Most often, in life, people want to search for media assets concerned by themselves, but do not know the name of a film source, and in such a case, equipment is needed to be capable of deeply analyzing voice information to recommend and search media.
For example, when a user searches for a movie of movie a, the current search method needs to know the name of the movie and yell "i want to watch movie a" with voice, the device receives the voice and then converts the voice recognition into characters, and after analyzing semantics, searches for the corresponding movie of the movie a; many times, however, the user does not know the title of the series and the user may only remember certain features, such as: movie a is the first television show of an actor 2016, and when(s) he shouts "i want to watch the first television show of the actor 2016", the device analyzes the speech by fields, and compares the fields in the database, and the corresponding video media cannot be searched because of the semantic meaning without matching the fields.
Disclosure of Invention
The invention aims to provide a video searching method based on voice characteristics, which is used for solving the technical problem that in the prior art, voice information change is not measured, and a label of enough data is difficult to be marked to ensure the precision of a searching algorithm, so that the accuracy of voice searching is not high.
The invention solves the problems through the following technical scheme:
a video searching method based on voice characteristics comprises the following steps:
step 1) extracting voice data to be searched into a plurality of multi-dimensional characteristic vectors;
step 2) converting the extracted voice data of the multi-dimensional feature vectors into a plurality of texts;
step 3) importing the plurality of texts obtained in the step 2) into a time sequence neural network for training to obtain a high-latitude feature vector;
step 4), the convolution neural network takes the characteristic vector of high latitude as input to carry out regression operation;
step 5) comparing the characteristic vectors after the convolution neural network regression operation with the characteristic vectors corresponding to the database, selecting the result with the highest similarity, and testing the selected result with the highest similarity to obtain the final accuracy;
and 6) outputting the result with the highest similarity selected in the step 5) by using video or audio, finishing the final search content, and feeding back the final search content to the user in a voice mode.
Preferably, the training set of the time-series neural network employs a CA8 training set.
Preferably, the samples of the CA8 training set are according to a training set, a validation set, and a training set: 8: 1: and (4) randomly cutting at a ratio of 1.
Preferably, the time-series neural network comprises 5 gating neural units which are connected in series.
Preferably, the interior of the gated neural unit includes a reset gate and an update gate.
Preferably, the convolutional neural network is arranged in a U-shaped structure, i.e., a decoder-encoder structure.
Preferably, the regression operation flow of the convolutional neural network is as follows: the decoder divides the acquired high-latitude characteristic vector into a plurality of linearly independent sub-dimension vectors, and obtains a middle vector of the plurality of sub-dimension vectors through convolution of a plurality of convolution layers with different dimensions; the encoder takes the intermediate vector as input and trains the convolutional neural network to carry out convolution combination to restore the original new characteristic diagram.
Compared with the prior art, the invention has the following advantages and beneficial effects:
the invention extracts the voice characteristics of multimedia in multiple dimensions through the time sequence neural network, and then searches the video by using the method of comparing the semantic acquired by the convolutional network combination with the user requirement, thereby improving the accuracy of voice search. The technical problems that the existing voice multimedia search system does not perform deep search according to voice information and users have more actual requirements to perform related search aiming at the characteristics in the voice are solved. The deep learning-based system is urgently needed, can complete multimedia resource recommendation in most of actual life, better optimizes user experience and improves applicability.
Drawings
FIG. 1 is an internal structure of a single gated neural unit of the present invention.
Fig. 2 shows a decoder-encoder structure according to the present invention.
FIG. 3 is an exemplary diagram of audio feature vectors according to the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples, but the embodiments of the present invention are not limited thereto.
Example 1:
referring to fig. 1 to 3, a video search method based on voice features includes the following steps:
step 1) cutting the voice information to be searched into a plurality of small segments, extracting each segment of voice information into a feature vector, wherein each feature vector is a multi-dimensional feature vector, as shown in fig. 3.
And 2) converting all the extracted multi-dimensional feature vectors into texts.
Step 3), setting a training set of the time-series neural network as a CA8 training set, wherein samples of the CA8 training set are as follows: 8: 1: 1, random cutting in proportion, firstly initializing the parameters of a reset gate and an update gate of a time sequence neural network, then importing the text converted in the step 2) into the time sequence neural network, managing and training data of an entering unit by using the reset gate inside a gated neural unit, transferring the data required to be transferred to a next unit to a next component through the update gate after internal updating is completed, extracting the relevant text characteristics of the imported text in the step 2) by using 5 continuously connected gated neural units, and finally obtaining a characteristic vector with high latitude, wherein the structure of each gated neural unit is shown in figure 1.
Step 4), designing the convolutional neural network into a U-shaped structure, as shown in fig. 2, namely a structure of a decoder-encoder, designing a plurality of convolutional layers with different dimensions by the decoder, wherein the convolutional layers are respectively 64 × 64, 32 × 32, 16 × 16 and 8 × 8, splitting the high-latitude characteristic vector obtained in the step 3) into a plurality of linear independent sub-dimension vectors by the decoder, and finally obtaining a middle vector of the plurality of sub-dimension vectors through the plurality of convolutional layers with different dimensions by the decoder; the encoder takes the intermediate vector obtained by the decoder as input, restores the original new characteristic diagram through the convolution layers with different dimensionalities of the encoder according to the modular length and the direction of the intermediate vector, and performs characteristic diagram fusion by utilizing an upper sampling layer to obtain a final characteristic diagram.
And 5) comparing the final feature map with the feature maps in the specific database for similarity, selecting a result with the highest similarity, verifying the result with the highest similarity, and testing the final accuracy.
And 6) outputting the result with the highest similarity selected by the convolutional neural network by using video or audio as output to finish the final search content, and feeding back the final search content to the user in a voice mode.
The invention extracts the voice characteristics of multimedia in multiple dimensions through the time sequence neural network, and then searches the video by using the method of comparing the semantic acquired by the convolutional network combination with the user requirement, thereby improving the accuracy of voice search. The technical problems that the existing voice multimedia search system does not perform deep search according to voice information and users have more actual requirements to perform related search aiming at the characteristics in the voice are solved. The deep learning-based system is urgently needed, can complete multimedia resource recommendation in most of actual life, better optimizes user experience and improves applicability.
Although the present invention has been described herein with reference to the illustrated embodiments thereof, which are intended to be preferred embodiments of the present invention, it is to be understood that the invention is not limited thereto, and that numerous other modifications and embodiments can be devised by those skilled in the art that will fall within the spirit and scope of the principles of this disclosure.
Claims (7)
1. A video searching method based on voice characteristics is characterized in that: the method comprises the following steps:
step 1) extracting voice data to be searched into a plurality of multi-dimensional characteristic vectors;
step 2) converting the extracted voice data of the multi-dimensional feature vectors into a plurality of texts;
step 3) importing the plurality of texts obtained in the step 2) into a time sequence neural network for training to obtain a high-latitude feature vector;
step 4), the convolution neural network takes the characteristic vector of high latitude as input to carry out regression operation;
step 5) comparing the characteristic vectors after the convolution neural network regression operation with the characteristic vectors corresponding to the database, selecting the result with the highest similarity, and testing the selected result with the highest similarity to obtain the final accuracy;
and 6) outputting the result with the highest similarity selected in the step 5) by using video or audio, finishing the final search content, and feeding back the final search content to the user in a voice mode.
2. The video searching method based on the voice feature of claim 1, wherein: the training set of the sequential neural network employs a CA8 training set.
3. The video searching method based on the voice feature of claim 2, wherein: the samples of the CA8 training set are as follows: 8: 1: and (4) randomly cutting at a ratio of 1.
4. The video searching method based on the voice feature of claim 1, wherein: the time sequence neural network comprises 5 gate control neural units which are connected in series.
5. The video searching method based on the voice feature of claim 4, wherein: the interior of the gated neural unit includes a reset gate and an update gate.
6. The video searching method based on the voice feature of claim 1, wherein: the convolutional neural network is arranged in a U-shaped structure, namely a decoder-encoder structure.
7. The video searching method based on the voice feature of claim 1, wherein: the regression operation flow of the convolutional neural network comprises the following steps: the decoder divides the acquired high-latitude characteristic vector into a plurality of linearly independent sub-dimension vectors, and obtains a middle vector of the plurality of sub-dimension vectors through convolution of a plurality of convolution layers with different dimensions; the encoder takes the intermediate vector as input, and the convolutional neural network carries out convolution combination to restore the original new characteristic diagram.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010604506.9A CN111753137B (en) | 2020-06-29 | 2020-06-29 | Video searching method based on voice characteristics |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010604506.9A CN111753137B (en) | 2020-06-29 | 2020-06-29 | Video searching method based on voice characteristics |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111753137A true CN111753137A (en) | 2020-10-09 |
CN111753137B CN111753137B (en) | 2022-05-03 |
Family
ID=72676733
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010604506.9A Active CN111753137B (en) | 2020-06-29 | 2020-06-29 | Video searching method based on voice characteristics |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111753137B (en) |
Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107193983A (en) * | 2017-05-27 | 2017-09-22 | 北京小米移动软件有限公司 | Image search method and device |
CN107577985A (en) * | 2017-07-18 | 2018-01-12 | 南京邮电大学 | The implementation method of the face head portrait cartooning of confrontation network is generated based on circulation |
CN109446332A (en) * | 2018-12-25 | 2019-03-08 | 银江股份有限公司 | A kind of people's mediation case classification system and method based on feature migration and adaptive learning |
US10261849B1 (en) * | 2017-08-11 | 2019-04-16 | Electronics Arts Inc. | Preventative remediation of services |
US20190156159A1 (en) * | 2017-11-20 | 2019-05-23 | Kavya Venkata Kota Sai KOPPARAPU | System and method for automatic assessment of cancer |
CN109979429A (en) * | 2019-05-29 | 2019-07-05 | 南京硅基智能科技有限公司 | A kind of method and system of TTS |
CN110059201A (en) * | 2019-04-19 | 2019-07-26 | 杭州联汇科技股份有限公司 | A kind of across media program feature extracting method based on deep learning |
CN110990597A (en) * | 2019-12-19 | 2020-04-10 | 中国电子科技集团公司信息科学研究院 | Cross-modal data retrieval system based on text semantic mapping and retrieval method thereof |
CN111078831A (en) * | 2019-11-06 | 2020-04-28 | 广州荔支网络技术有限公司 | Optimization method for converting audio content into text in text reading |
CN111126059A (en) * | 2019-12-24 | 2020-05-08 | 上海风秩科技有限公司 | Method and device for generating short text and readable storage medium |
CN111191075A (en) * | 2019-12-31 | 2020-05-22 | 华南师范大学 | Cross-modal retrieval method, system and storage medium based on dual coding and association |
CN111241996A (en) * | 2020-01-09 | 2020-06-05 | 桂林电子科技大学 | Method for identifying human motion in video |
CN111312228A (en) * | 2019-12-09 | 2020-06-19 | 中国南方电网有限责任公司 | End-to-end-based voice navigation method applied to electric power enterprise customer service |
CN111488871A (en) * | 2019-01-25 | 2020-08-04 | 斯特拉德视觉公司 | Method and apparatus for switchable mode R-CNN based monitoring |
-
2020
- 2020-06-29 CN CN202010604506.9A patent/CN111753137B/en active Active
Patent Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107193983A (en) * | 2017-05-27 | 2017-09-22 | 北京小米移动软件有限公司 | Image search method and device |
CN107577985A (en) * | 2017-07-18 | 2018-01-12 | 南京邮电大学 | The implementation method of the face head portrait cartooning of confrontation network is generated based on circulation |
US10261849B1 (en) * | 2017-08-11 | 2019-04-16 | Electronics Arts Inc. | Preventative remediation of services |
US20190156159A1 (en) * | 2017-11-20 | 2019-05-23 | Kavya Venkata Kota Sai KOPPARAPU | System and method for automatic assessment of cancer |
CN109446332A (en) * | 2018-12-25 | 2019-03-08 | 银江股份有限公司 | A kind of people's mediation case classification system and method based on feature migration and adaptive learning |
CN111488871A (en) * | 2019-01-25 | 2020-08-04 | 斯特拉德视觉公司 | Method and apparatus for switchable mode R-CNN based monitoring |
CN110059201A (en) * | 2019-04-19 | 2019-07-26 | 杭州联汇科技股份有限公司 | A kind of across media program feature extracting method based on deep learning |
CN109979429A (en) * | 2019-05-29 | 2019-07-05 | 南京硅基智能科技有限公司 | A kind of method and system of TTS |
CN111078831A (en) * | 2019-11-06 | 2020-04-28 | 广州荔支网络技术有限公司 | Optimization method for converting audio content into text in text reading |
CN111312228A (en) * | 2019-12-09 | 2020-06-19 | 中国南方电网有限责任公司 | End-to-end-based voice navigation method applied to electric power enterprise customer service |
CN110990597A (en) * | 2019-12-19 | 2020-04-10 | 中国电子科技集团公司信息科学研究院 | Cross-modal data retrieval system based on text semantic mapping and retrieval method thereof |
CN111126059A (en) * | 2019-12-24 | 2020-05-08 | 上海风秩科技有限公司 | Method and device for generating short text and readable storage medium |
CN111191075A (en) * | 2019-12-31 | 2020-05-22 | 华南师范大学 | Cross-modal retrieval method, system and storage medium based on dual coding and association |
CN111241996A (en) * | 2020-01-09 | 2020-06-05 | 桂林电子科技大学 | Method for identifying human motion in video |
Non-Patent Citations (1)
Title |
---|
卢泓宇等: "卷积神经网络特征重要性分析及增强特征选择模型", 《软件学报》 * |
Also Published As
Publication number | Publication date |
---|---|
CN111753137B (en) | 2022-05-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111191078B (en) | Video information processing method and device based on video information processing model | |
CN108920497B (en) | Man-machine interaction method and device | |
CN108959627B (en) | Question-answer interaction method and system based on intelligent robot | |
US11151191B2 (en) | Video content segmentation and search | |
CN112989212B (en) | Media content recommendation method, device and equipment and computer storage medium | |
CN111831789A (en) | Question-answer text matching method based on multilayer semantic feature extraction structure | |
CN111552799A (en) | Information processing method, information processing device, electronic equipment and storage medium | |
CN114298121A (en) | Multi-mode-based text generation method, model training method and device | |
CN115495555A (en) | Document retrieval method and system based on deep learning | |
CN113672708A (en) | Language model training method, question and answer pair generation method, device and equipment | |
CN113806588B (en) | Method and device for searching video | |
CN112650842A (en) | Human-computer interaction based customer service robot intention recognition method and related equipment | |
CN113392265A (en) | Multimedia processing method, device and equipment | |
CN115293348A (en) | Pre-training method and device for multi-mode feature extraction network | |
CN116306603A (en) | Training method of title generation model, title generation method, device and medium | |
CN112085120A (en) | Multimedia data processing method and device, electronic equipment and storage medium | |
CN116662565A (en) | Heterogeneous information network keyword generation method based on contrast learning pre-training | |
CN114120166A (en) | Video question and answer method and device, electronic equipment and storage medium | |
CN116977701A (en) | Video classification model training method, video classification method and device | |
CN111753137B (en) | Video searching method based on voice characteristics | |
CN110941958A (en) | Text category labeling method and device, electronic equipment and storage medium | |
CN112861580A (en) | Video information processing method and device based on video information processing model | |
CN113553844B (en) | Domain identification method based on prefix tree features and convolutional neural network | |
CN115062123A (en) | Knowledge base question-answer pair generation method of conversation generation system | |
CN114565804A (en) | NLP model training and recognizing system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |