Question Generation (QG) is a task to generate questions based on an input context. Question Generation can be solved in several ways, ranging from conventional rule-based systems to recently emerging sequence-to-sequence approaches. The limitation of most QG systems is its limitation on input form, which is mainly only on text data. On the other hand, Multimodal QG covers several different inputs such as: text, image, table, video, or even acoustics. In this paper, we present our proposed method to handle the Multimodal Question Generation task using an attachment to a BERT-based model called Multimodal Adaptation Gate (MAG). The results show that using the proposed method, this development succeeds to do a Multimodal Question Generation task. The generated questions give 16.05 BLEU 4 and 28.27 ROUGE-L scores, accompanied by the human evaluation to judge the generated questions from the model, resulting in 55% fluency and 53% relevance.