Comparing Two Powerful Multimodal AI Models

Blog

Comparing Two Powerful Multimodal AI Models

Introduction

With the release of GPT-4o, this model is getting huge attention for its multimodal capabilities. GPT-4o is known for its advanced language processing skills and has been enhanced to interpret and generate visual content. However, we shouldn’t overlook Gemini, a model that has been highly praised for its multimodal abilities long before GPT-4o arrived. Gemini excels at combining image recognition with strong language understanding, making it a worthy competitor to GPT-4o.

In this article, we’ll compare GPT-4o vs Gemini by examining how well they perform in different tasks. By looking at their performance, we aim to determine which model is superior. This comparison is crucial because the ability to work with both text and images is valuable for many applications, such as automatic content creation and data analysis.

GPT -4o vs Gemini

Let’s compare GPT-4o and Gemini to see which one performs better.

Calculate Sum of Numbers

A very basic thing for a multimodal LLM would be to correctly identify the text/numbers written in a given image. So here we’ve provided an image with certain text and have asked our 2 competitors, GPT-4o and Gemini, to calculate the sum of the given numbers in the image. Let’s see who’s going to win this round.

GPT -4o

Here you can see, GPT-4o provided the correct output. Easy peasy.

Gemini

Not sure what Gemini understood with the given prompt. Although it was pretty simple, Gemini couldn’t understand the context.

Result

GPT-4o won!

Code Game Provided in the Image Attached in Python

In this round, we’ve provided an image of a tic-tac-toe game but we’ve not mentioned specifically in the prompt. The task for our models is to first identify the game and then write a python code to implement it.

GPT-4o

def print_board(board):
for row in board:
     print(" | ".join(row))
     print("-" * 9)
def check_winner(board):
# Check rows
for row in board:
     if row[0] == row[1] == row[2] != " ":
         return row[0]
# Check columns
for col in range(3):
     if board[0][col] == board[1][col] == board[2][col] != " ":
         return board[0][col]
# Check diagonals
if board[0][0] == board[1][1] == board[2][2] != " ":
     return board[0][0]
if board[0][2] == board[1][1] == board[2][0] != " ":
     return board[0][2]
return None
def is_board_full(board):
for row in board:
     if " " in row:
         return False
return True
def main():
board = [
     [" ", " ", "O"],
     ["O", "X", " "],
     [" ", " ", " "]
]
]
current_player = "X"  # X's turn to play
print("Initial Board:")
print_board(board)
while True:
     row = int(input(f"Player {current_player}, enter the row (0, 1, 2): "))
     col = int(input(f"Player {current_player}, enter the column (0, 1, 2): "))
     if board[row][col] == " ":
         board[row][col] = current_player
     else:
         print("Cell already taken! Try again.")
         continue
     print_board(board)
     winner = check_winner(board)
     if winner:
         print(f"Player {winner} wins!")
         break
     if is_board_full(board):
         print("It's a draw!")
         break
     current_player = "X" if current_player == "O" else "O"
if __name__ == "__main__":
main()

Output:

The code provided by GPT-4o gave this above output. Great job in identifying the game correctly and creating a grid as well. Although one of the “o” is misplaced in the given output but otherwise we’ve received a fully functional tic-tac-toe game.

Gemini

def display_board(board):
  """Displays the current board state"""
  for row in board:
for cell in row:
   print(cell, end=" ")
print()
def is_valid_move(board, row, col)
  """Checks if a move is valid (i.e. the cell is empty)"""
  return board[row][col] == " "
def make_move(board, player, row, col):
  """Places a player's mark on the board"""
  board[row][col] = player
def is_winner(board, player):
  """Checks if a player has won"""
  # Check rows
  for row in board:
if all(cell == player for cell in row):
   return True
  # Check columns
  for col in range(len(board[0])):
if all(board[row][col] == player for row in range(len(board))):
   return True
  # Check diagonals
  if all(board[row][row] == player for row in range(len(board))):
return True
  if all(board[row][len(board) - row - 1] == player for row in range(len(board))):
return True
  return False
def is_board_full(board):
  """Checks if the board is full"""
  for row in board:
for cell in row:
   if cell == " ":
     return False
  return True
def main():
  """Main game loop"""
  board = [[" " for _ in range(3)] for _ in range(3)]
  current_player = "X"
  while True:
display_board(board)
# Get player move
while True:
   row = int(input("Player {} enter row (1-3): ".format(current_player))) - 1
   col = int(input("Player {} enter column (1-3): ".format(current_player))) - 1
   if is_valid_move(board, row, col):
     break
   else:
     print("Invalid move. Please try again.")
# Make move and check for winner
make_move(board, current_player, row, col)
if is_winner(board, current_player):
   display_board(board)
   print("Player {} wins!".format(current_player))
   break
# Check for tie
if is_board_full(board):
   display_board(board)
   print("It's a tie!")
   break
# Switch player
current_player = "O" if current_player == "X" else "X"
if __name__ == "__main__":
  main()

Output:

Now, when we run the code provided by Gemini, then it identified the game clearly but as you can see no we did not get any grid. This made it difficult to play the game. Below you can check how the game looks without the grid.

Result

GPT-4o won!

Generate Python Code to Recreate Bar Chart using Matplotlib

In this round we’ve given an image of a bar chart to both our models. They need to analyze the chart and generate the Python code required to recreate it using Matplotlib. Additionally, they should ensure that the code produces the same bar chart when run. Let’s see which model wins this round.

GPT-4o

import matplotlib.pyplot as plt
import numpy as np
# Data
quarters = ['Q1', 'Q2', 'Q3', 'Q4']
sales_A = [10, 15, 17, 12]
sales_B = [10, 15, 20, 15]
# Plotting
fig, ax = plt.subplots()
bar_width = 0.35
index = np.arange(len(quarters))
bar1 = ax.bar(index, sales_A, bar_width, label="A")
bar2 = ax.bar(index, sales_B, bar_width, bottom=sales_A, label="B")
ax.set_xlabel('Quarter')
ax.set_ylabel('Sales')
ax.set_title('Sales by Product & Quarter')
ax.set_xticks(index)
ax.set_xticklabels(quarters)
ax.legend()
plt.show()

Output:

Gemini

import matplotlib.pyplot as plt
# Data
products = ["A", "B"]
sales_q1 = [30, 25]
sales_q2 = [40, 35]
sales_q3 = [20, 30]
sales_q4 = [10, 15]
# Create a bar chart
fig, ax = plt.subplots()
width = 0.35  # the width of the bars
index = range(len(products))  # the x-axis positions for the bars
bar1 = plt.bar(index, sales_q1, width, label="Q1")
bar2 = plt.bar([p + width for p in index], sales_q2, width, label="Q2")
bar3 = plt.bar([p + width * 2 for p in index], sales_q3, width, label="Q3")
bar4 = plt.bar([p + width * 3 for p in index], sales_q4, width, label="Q4")
# Add labels and title
plt.xlabel('Products')
plt.ylabel('Sales')
plt.title('Sales by Product & Quarter')
# X-axis labels
plt.xticks([p + 1.5 * width for p in index], products)
# Legend
plt.legend()
plt.show()

Output:

Result

GPT-4o won!

Explain Code and Provide the Output

Here we’ve given an image input to both our models and they have to understand the code written in the provided screenshot and additionally provide the output for the same. Let’s see how they perform on this test.

GPT-4o

Provided a very long summary but here’s the summary and output:

Gemini

Got the below explanation but no output for the code.

Point goes to GPT-4o for properly understanding the prompt and providing correct output as well.

Result

GPT-4o won!

Identify Buttons and Input Fields in the Given Design

In this prompt the models were asked for a detailed analysis of a user interface (UI) design to locate and describe interactive elements such as buttons and input fields. The goal is to specify what each element is, its purpose, and any associated labels or features.

GPT-4o

Impressive how accurately GPT-4o can identify items in a design with a clear understanding of each button, checkbox and textbox.

Gemini

Gemini got the input fields correct but there was some uncertainty in the submit button which was square in shape.

Result

GPT-4o won!

GPT -4o vs Gemini: Final Verdict

Tasks	Winner
Calculating the sum of numbers from an image.	GPT-4o
Writing Python code for a tic-tac-toe game based on an image.	GPT-4o
Creating Python code to recreate a bar chart from an image.	GPT-4o
Explaining code from a screenshot and providing the output.	GPT-4o
Identifying buttons and input fields in a user interface design.	GPT-4o

In this head-to-head comparison, GPT-4o clearly outperformed Gemini. GPT-4o consistently provided accurate and detailed results across all tasks, from calculating sums and coding games to generating bar charts and analyzing UI designs. It showed a strong ability to understand and process both text and images effectively.

Gemini, on the other hand, struggled in several areas. While it performed adequately in some tasks, it often failed to provide detailed explanations or accurate coding. Its performance was inconsistent, highlighting its limitations compared to GPT-4o.

Overall, GPT-4o proved to be the more reliable and versatile model. Its superior performance across multiple tasks makes it the clear winner in this comparison. If you need a model that can handle both text and images with high accuracy, GPT-4o is the better choice. In this article we explored GPT-4o vs Gemini.

I am a data lover and I love to extract and understand the hidden patterns in the data. I want to learn and grow in the field of Machine Learning and Data Science.

Source link

Blog