Great Deal! Get Instant $10 FREE in Account on First Order + 10% Cashback on Every Order Order Now

Microsoft Word - assignment.docx Project Aim Statement: Goal - My project, aimed at achieving a long-term goal in artificial intelligence, is to build a "multimodal" neural network which is a part of...

1 answer below »

Microsoft Word - assignment.docx


Project Aim Statement:
Goal - My project, aimed at achieving a long-term goal in artificial intelligence, is to build a "multimodal" neural
network which is a part of one of the research areas in AI. Multimodal is an AI system that learns concepts in
multiple modalities, primarily text and images, in order to better understand the world.
Objective - To develop hypotheses and working concepts to achieve multimodal inference for scenarios where
questions are input via text to images or videos that the model can adequately answer.
The hypothesis will be based on Visual Question Answering (VQA).

Visual Question Answering (VQA) is a task that combines computer vision, natural language processing, and
deep learning.
VQA is the phenomenon of freely asking questions in natural language about visual (image/video) content.
However, answering these questions requires a wide range of skills. These skills include proper localization and
ecognition of objects, people, their activities, and common sense.

Task - Given an image, a visual question-answering algorithm allows the machine to answer free-form, Open-
ended, natural-language questions about the image.
Approach towards Task:

Baseline Model – Natural language processing (NLP) strategy for converting a text document into numbers
that can be used by a computer program, therefore BOW Q model could be the best as a baseline model

BOW Q (Bag of words)

Visual Features - Convolutional Neural Networks (CNN) commonly used for an image classification task, given
elow have been chose for visual representation.

§ ResNet
§ Inception

Language Features – Recu
ent Neural Network (RNN) is a state-of-the-art deep learning algorithm used for
modeling sequential information.
§ LSTM Q
§ Bi-LSTM Q

Fusion Model - The information from the text and image encoders are fused into a combined representation
to perform the downstream task. Late fusion or decision level fusion can be implemented due to its feature of
computing separately shape/colour and concatenated.

§ Fuse B + MLP

Staggered Aim:

As my area of interest in this project grew, the above-mentioned approach has been planned to implement in
terms of achieving the goal of the project. However, I would also like to propose a technique of Deep Modular
Co-Attention Networks for Visual Question Answering as an alternative if any bias issues might be faced while
progressing. In order to prevent biases, I will be taking it into special consideration and will immediately
terminate by applying the alternative plan or combination of other approaches to make the final decision.


§ The expected outcome is that the multimodal reasoning model outperforms the baseline in terms of
accuracy without any bias.
§ As given an overview of Staggered aim which is a state-of-art approach and performed better than the
prior introduced approach (e.g.: SOTA VQA) and proposed with better accuracy. Expected outcome
from this approach will be definitely filling the requirement of the Objective as it is deep and dense
followed by the 6-layer MCAN with encoder-decoder strategy.
§ Model will be evaluated based on validation, optimal performance approach and accuracy without any
ias to fulfill the requirement of the task.



Data link: - The data can be chosen from any available sources such as Kaggle, Hugging Face, VQA, COCO.
1. Suitable computer vision task/s (object detection, object classification, face detection) to create representation
using various facets (multifaceted image representation) along with question representation followed by answer
classification network.
2. Co-attention model to generate question attention and image attention; one attention model guides the other.
Encoding of questions can be done at various levels such as word level, phrase level or question level.
Outcome: - expected would be working model which will take in Question through text about Image/video. It will
espond to the question by performing natural language processing and object detection/classification as
necessary. Analysis of the model (including accuracy) against baseline is also expected.




Expected Outcomes | Deliverables:
Answered 1 days After Aug 29, 2022

Solution

Aditi answered on Aug 30 2022
61 Votes
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here