1. Ask Your Neurons
    We address a question answering task on real-world images that is set up as a Visual Turing Test. By combining latest advances in image representation and natural language processing, we propose Neural-Image-QA, an end-to-end formulation to this problem for which all parts are trained jointly. In contrast to previous efforts, we are facing a multi-modal problem where the language output (answer) is conditioned on visual and natural language input (image and question). Our approach Neural-Image-QA doubles the performance of the previous best approach on this problem. We provide additional insights into the problem by analyzing how much information is contained only in the language part for which we provide a new human baseline. To study human consensus, which is related to the ambiguities inherent in this challenging task, we propose two novel metrics and collect additional answers which extends the original DAQUAR dataset to DAQUAR-Consensus.

  2. Networks with attention
    Attention plays an important role in human's live, but how to enrich machines with such capability. Two main motivations behind an attention mechanism are: higher capacity of a network, and speed. This talk discusses a few publications, grouped into three categories, that attempt to couple attention with deep architectures.

  3. Semantic parsing via paraphrasing
    The holy grail of NLP is language understanding by machines. But how to represent the meaning? Semantic parsers represent it with a pre-defined formal language that can easily be executed by machines (e.g. SQL), and next learn to fit formulas to textual questions to retrieve the answer from a knowledge base. Although originally an expensive corpus of textual questions and the corresponding formulas was required, this talk discusses a very recent approach to train a semantic parser solely from textual question-answer pairs.

  4. Visual Turing Test - Beginning
    My first slides on a Visual Turing Test - I start with a question if machines can answer questions about images, create a suitable dataset, search for reasonable evaluation metrics, and get inspirations from the previous work on semantic parsing and grounding.