Class-dependent and cross-modal memory network considering sentimental features for video-based captioning
Class-dependent and cross-modal memory network considering sentimental features for video-based captioning
Blog Article
The video-based commonsense captioning task aims to add multiple commonsense descriptions to video captions to understand video content better.This paper aims to consider the importance of cross-modal mapping.We propose a combined framework called Class-dependent and Cross-modal Memory Network considering SENtimental features (CCMN-SEN) for Video-based Captioning to enhance commonsense caption generation.
Firstly, we develop class-dependent memory for recording the alignment shamrock belt buckle between video features and text.It only allows cross-modal interactions and generation on cross-modal matrices that share the same labels.Then, to understand the sentiments conveyed in the videos and generate 14765-prb-a01 accurate captions, we add sentiment features to facilitate commonsense caption generation.
Experiment results demonstrate that our proposed CCMN-SEN significantly outperforms the state-of-the-art methods.These results have practical significance for understanding video content better.