CLOSED: Vision and Language Integration Meets Multimedia Fusion – Call for Papers


Guest Editors
Marie-Francine Moens, KU Leuven
Katerina Pastra, Cognitive Systems Research Institute
Kate Saenko, Boston University
Tinne Tuytelaars, KU Leuven

Submission deadline: CLOSED
Publication: April–June 2018

Multimodal information fusion, at both the signal and semantics levels, is a core part of most multimedia applications, such as indexing, retrieval, and summarization. Early or late fusion of modality-specific processing results has been addressed in prototypes since the early days of multimedia through various methodologies, including rule-based approaches, information-theoretic models, and machine learning. Vision and language are two of the predominant modalities that are being fused and that have attracted special attention in international challenges (such as TRECVid and ImageClef), with a long history of results.

During the last decade, vision-language semantic integration has attracted attention from traditionally non-interdisciplinary research communities, such as computer vision and natural language processing, because one modality can greatly assist the processing of another—providing cues for disambiguation, complementary information, and noise/error filtering. The latest boom of deep-learning methods has opened up new directions in joint modeling of visual and co-occurring verbal information in multimedia discourse. This evolution provides an opportunity for studying the concept of information fusion of language and visual data in a deep-learning framework. In addition, this might require machine-learning adaptations or novel approaches to deal with highly diverse and unstructured language and visual data that can be complementary, redundant, noisy, of differing certainty, and even contradictory.

This special issue of IEEE MultiMedia will explore multimedia vision and language fusion in theory and application. We solicit original research, reviews, and opinions on vision-language integration models, methodologies, and challenges. Topics of interest include (but are not limited to)

• models at the signal or semantic level;
• methodologies such as deep-learning methods, cognitive modeling, and distributional semantics;
• fusion in multimedia documents, including audiovisual documents and captioned image archives; and
• models in diverse domains, such as cultural heritage, e-commerce, education, health, and social media.

We welcome papers on model integration within the following applications:

• multimedia search, retrieval, and question answering
• multimedia annotation and indexing
• multimedia recommendation and summarization
• multimodal translation between language and vision
• multisensory human-computer and human-robot interaction

Submission Guidelines

See the author guidelines for submission requirements. Submissions should not exceed 6,500 words, including all text, the abstract, keywords, bibliography, biographies, and table text. Each table and figure counts for 200 words. Submit electronically through ScholarOne Manuscripts, selecting this special-issue option.


Contact the guest editors at