AI Tools

How to Run Gemma 4 Locally for Free: A Beginner's Guide With Ollama and LM Studio

By Sagnik Bhattacharya 5 Apr 2026 14 min read

Coding Liquids blog cover featuring Sagnik Bhattacharya showing how to run Gemma 4 locally with Ollama and LM Studio, with terminal and GUI interface visuals.

Every time I mention running AI models locally in a workshop, I see the same mix of curiosity and scepticism. People assume it requires a server farm, a computer science degree, or at least a mass of Linux configuration. The reality in 2026 is very different. With tools like Ollama and LM Studio, you can have Google's Gemma 4 running on your own machine in under ten minutes -- no cloud subscription, no data leaving your network, and no usage caps.

I teach Flutter and Excel with AI — explore my courses if you want structured learning.

I started running models locally about two years ago, initially out of curiosity. Now it is a core part of how I work. For sensitive client data, for offline train journeys, for rapid prototyping without worrying about API costs -- local inference has become genuinely practical. This guide walks you through the entire setup process from scratch, whether you prefer a command-line workflow or a graphical interface.

Why Run AI Locally?

Before diving into the how, it is worth understanding the why. Cloud-based AI tools like ChatGPT, Claude, and Gemini are excellent, and I use them daily. But they come with trade-offs that matter in specific situations:

Privacy and data control. When you send a prompt to a cloud AI service, your data travels to their servers. For most casual use, this is fine. But if you are working with client financial data, patient records, proprietary research, or any information governed by compliance requirements, keeping everything on your own hardware eliminates an entire category of risk.
Zero ongoing cost. Cloud AI services either charge per API call or limit free-tier usage. Running Gemma 4 locally costs nothing after the initial setup. There are no token limits, no monthly caps, and no surprise bills. If you make heavy use of AI throughout the day, the savings add up quickly.
Offline availability. Local models work without an internet connection. I have written entire sections of training materials on flights and train rides using Gemma running on my laptop. Cloud services are useless in those situations.
Customisation and integration. Running locally gives you full control over the model's behaviour. You can adjust parameters like temperature and context length, create custom system prompts, and integrate the model into your own scripts and applications through a local API.
No rate limiting. Cloud free tiers throttle usage during peak times. Locally, the only bottleneck is your hardware, and it is consistently available.

The trade-off is that local models are generally slower than cloud-hosted ones (unless you have high-end hardware), and the largest models may not fit on consumer machines. The key is choosing the right model size for your hardware. For a detailed comparison of how Gemma 4 stacks up against cloud services, see my Gemma 4 vs ChatGPT vs Claude vs Copilot comparison.

Hardware Requirements

Gemma 4 comes in several sizes, and the hardware you need depends entirely on which variant you choose. Here is a practical breakdown based on my testing across different machines:

Gemma 4 Variant	Parameters	Minimum RAM	Recommended GPU VRAM	Disk Space	Suitable For
Gemma 4 2B	2 billion	8 GB	None required (CPU-only works)	~2 GB	Quick tasks, older laptops, testing
Gemma 4 9B	9 billion	16 GB	6 GB+ (recommended)	~6 GB	General use, good quality-speed balance
Gemma 4 12B	12 billion	16 GB	8 GB+	~8 GB	Strong reasoning, Excel formulas, writing
Gemma 4 27B	27 billion	32 GB	16-24 GB	~18 GB	Best quality, complex analysis, coding

A note on Apple Silicon Macs: If you have an M1, M2, M3, or M4 Mac, you are in a particularly good position. Apple Silicon uses unified memory, meaning the GPU and CPU share the same RAM pool. A MacBook Pro with 32 GB of unified memory can comfortably run the 27B model. A MacBook Air with 16 GB handles the 12B model well. The Metal framework acceleration in both Ollama and LM Studio is mature and performs excellently on these machines.

For Windows and Linux users: The most important component is your GPU. An NVIDIA RTX 3060 (12 GB VRAM) or better handles the 12B model comfortably. An RTX 4070 or 4080 handles the 27B model. AMD GPUs work through ROCm support in Ollama, though NVIDIA CUDA remains the more stable option. Running on CPU only is possible for smaller models but will be noticeably slower.

Method 1: Ollama -- The Command-Line Approach

Ollama is the tool I recommend for most power users. It is fast, lightweight, and stays out of your way. It runs as a background service and provides both a CLI and a local REST API, making it easy to integrate into scripts, applications, and automation workflows.

Step 1: Install Ollama

Download Ollama from ollama.com. Installers are available for macOS, Windows, and Linux.

On macOS: Download the .dmg file, drag Ollama to Applications, and launch it. It will install itself and run as a menu bar application.

On Windows: Download the installer and run it. Ollama will install as a system service that starts automatically.

On Linux: The simplest installation is through the install script:

curl -fsSL https://ollama.com/install.sh | sh

After installation, verify it is working by opening a terminal and running:

ollama --version

You should see a version number. If you see a "command not found" error on Windows, close and reopen your terminal to pick up the PATH changes.

Step 2: Pull the Gemma 4 Model

With Ollama installed, downloading Gemma 4 is a single command. Choose the variant that matches your hardware:

# For the 2B model (fastest, lowest requirements)
ollama pull gemma4:2b

# For the 12B model (good balance of quality and speed)
ollama pull gemma4:12b

# For the 27B model (best quality, needs strong hardware)
ollama pull gemma4:27b

The download size varies from about 1.5 GB (2B) to about 16 GB (27B). On a typical broadband connection, the 12B model takes 5-10 minutes to download. Ollama handles quantisation automatically -- the default quantisation (Q4_K_M) provides an excellent balance between quality and performance.

Step 3: Run Your First Prompt

Once the model is downloaded, start an interactive chat session:

ollama run gemma4:12b

This drops you into a chat interface in your terminal. Type a prompt and press Enter:

>>> Write an Excel SUMIFS formula that totals column D where column B is "North" and column C is "Electronics"

=SUMIFS(D:D, B:B, "North", C:C, "Electronics")

This formula uses SUMIFS to sum values in column D based on two criteria:
- Column B must equal "North"
- Column C must equal "Electronics"
...

To exit the chat, type /bye and press Enter.

Step 4: Use the Ollama API

One of Ollama's best features is its built-in REST API, which runs on localhost:11434 by default. This lets you integrate Gemma 4 into scripts, applications, and workflows. Here is a simple example using curl:

curl http://localhost:11434/api/generate -d '{
  "model": "gemma4:12b",
  "prompt": "Write a nested IF formula that classifies exam scores: 90+ is Distinction, 75-89 is Merit, 60-74 is Pass, below 60 is Fail.",
  "stream": false
}'

The API is compatible with many tools and libraries. If you use Python, the ollama Python package provides a clean interface:

import ollama

response = ollama.generate(
    model='gemma4:12b',
    prompt='Write an INDEX-MATCH formula with two criteria columns.'
)
print(response['response'])

This API compatibility makes Ollama particularly useful for building custom tools -- for example, a batch formula generator that processes a list of Excel formula requests from a text file.

Step 5: Customise With a Modelfile

For recurring use cases, you can create a custom Modelfile that sets a system prompt, adjusts parameters, and gives your configuration a name. Create a file called ExcelHelper with the following content:

FROM gemma4:12b
SYSTEM "You are an expert Excel formula assistant. Always provide the formula first, then explain each component. Mention any version requirements (e.g., Microsoft 365 only). Flag potential issues with blank cells or errors. Use British English spelling."
PARAMETER temperature 0.3
PARAMETER num_ctx 4096

Then create the custom model:

ollama create excel-helper -f ExcelHelper

Now you can use it with:

ollama run excel-helper

The lower temperature (0.3) makes the model more deterministic, which is exactly what you want for formula generation where creativity is less valuable than precision.

Method 2: LM Studio -- The GUI Approach

If you prefer a visual interface over the command line, LM Studio is an excellent alternative. It provides a ChatGPT-style chat interface, a model browser, and a built-in server -- all in a desktop application.

Step 1: Install LM Studio

Download LM Studio from lmstudio.ai. It is available for macOS, Windows, and Linux. The installation is a standard desktop application installer -- no terminal required.

Step 2: Download Gemma 4

Open LM Studio and navigate to the Discover tab (the magnifying glass icon). Search for "Gemma 4" and you will see available variants. Click the download button next to the variant you want. LM Studio shows you the file size, quantisation level, and estimated RAM usage before you download, which helps you choose the right model for your hardware.

I recommend starting with the 12B Q4_K_M quantisation. It offers the best quality-to-performance ratio for most consumer hardware. If your machine has less than 16 GB of RAM, go with the 9B or 2B variant instead.

Step 3: Start a Chat

Switch to the Chat tab (the speech bubble icon). Select your downloaded Gemma 4 model from the dropdown at the top. You will see a familiar chat interface where you can type prompts and receive responses, exactly like ChatGPT but running entirely on your machine.

The first response after loading a model takes a few seconds longer as the model loads into memory. Subsequent responses are faster. You can adjust settings like temperature, max tokens, and system prompt from the right-hand panel.

Step 4: Use the Built-in Server

LM Studio also includes a local server that exposes an OpenAI-compatible API. Navigate to the Server tab (the arrows icon), select your model, and click Start Server. This creates an API endpoint at http://localhost:1234/v1 that is compatible with any tool or library designed for the OpenAI API.

This means you can use LM Studio as a drop-in replacement for OpenAI in many applications by simply changing the base URL to http://localhost:1234/v1 and using any string as the API key.

Performance Comparison: Local Gemma 4 vs Cloud AI

Here is what you can realistically expect in terms of performance, based on my testing across different hardware configurations:

Metric	Gemma 4 12B (Local, RTX 4070)	Gemma 4 27B (Local, RTX 4080)	ChatGPT (GPT-4o, Cloud)	Claude (Cloud)
First Token Latency	~1.5 seconds	~3 seconds	~0.5 seconds	~0.8 seconds
Generation Speed	~20-30 tokens/sec	~10-15 tokens/sec	~50-80 tokens/sec	~40-60 tokens/sec
Excel Formula Quality	Very good	Excellent	Excellent	Excellent
Cost Per Query	Electricity only	Electricity only	Free tier limited	Paid
Rate Limits	None	None	Yes (free tier)	Yes
Data Privacy	Full -- nothing leaves your machine	Full -- nothing leaves your machine	Data sent to OpenAI	Data sent to Anthropic
Offline Use	Yes	Yes	No	No

The speed difference is noticeable but not deal-breaking for most workflows. When I use Gemma 4 locally for Excel formula generation, a typical response takes 3-5 seconds including a full explanation. That is slower than ChatGPT's near-instant replies, but fast enough that it does not disrupt my workflow. For a detailed accuracy comparison of Gemma 4 against GPT-4o and Llama 4 specifically for Excel formulas, see my Gemma 4 vs GPT-4o vs Llama 4 for Excel comparison.

When Local Makes Sense vs When Cloud Is Better

Running AI locally is not always the right choice. Here is how I think about the decision:

Use Local Gemma 4 When:

You are working with sensitive or confidential data that cannot leave your network
You need consistent, unlimited access without worrying about rate limits or subscription costs
You are working offline or in environments with unreliable internet
You want to build custom tools, scripts, or integrations using the local API
You are doing repetitive batch processing (for example, generating formulas for 50 different Excel scenarios)
You want full control over model parameters and system prompts

Use Cloud AI (ChatGPT, Claude, Gemini) When:

You need the fastest possible response times and your data is not sensitive
You need access to the absolute largest models (GPT-4o, Claude Opus, Gemini Ultra) which are too large for consumer hardware
You want multimodal capabilities like image analysis or file uploads
You are working on a machine without a capable GPU (for example, a Chromebook or thin client)
You need features like web browsing, plugin ecosystems, or built-in tools that cloud platforms provide

In practice, most power users end up using both. I use Gemma 4 locally for formula generation, code assistance, and data analysis prompts. I use cloud services when I need image understanding, very long context windows, or the latest model capabilities. For a deeper dive into where Gemma 4 outperforms even paid cloud models, see 5 real-world tasks where Gemma 4 beats paid AI models.

Practical Tips for Getting the Best Output Quality Locally

Running a model locally gives you more control, but it also means you are responsible for the configuration. Here are the adjustments that make the biggest difference in my experience:

Choose the Right Quantisation

Quantisation reduces the model's memory footprint by using lower-precision numbers. The default Q4_K_M quantisation is the best starting point for most users -- it reduces memory usage by roughly 60% with minimal quality loss. If you have plenty of VRAM, try Q5_K_M or Q6_K for slightly better quality. Avoid Q2 or Q3 quantisations unless you are severely hardware-constrained, as the quality drop becomes noticeable.

Set an Appropriate Context Length

The context length determines how much conversation history the model can consider. For Excel formula work, 4096 tokens is usually sufficient. For longer conversations or document analysis, increase it to 8192 or beyond. Note that longer context lengths require more memory. In Ollama, set this with the num_ctx parameter. In LM Studio, adjust it in the model settings panel.

Use System Prompts Effectively

A well-crafted system prompt dramatically improves output consistency. For Excel work, I use a system prompt that instructs the model to provide the formula first, explain each component, mention version requirements, and flag potential issues. This eliminates the need to ask follow-up questions for every response.

Lower the Temperature for Technical Tasks

Temperature controls randomness in the model's output. For creative writing, higher temperature (0.7-0.9) adds variety. For technical tasks like formula generation, code writing, or data analysis, lower temperature (0.1-0.4) produces more consistent, deterministic output. I use 0.3 for Excel formula work and 0.2 for code generation.

Keep the Model Loaded

Both Ollama and LM Studio keep the model in memory after the first prompt, making subsequent responses faster. Avoid stopping and restarting the model between prompts. In Ollama, the model stays loaded for 5 minutes of inactivity by default. You can extend this with the OLLAMA_KEEP_ALIVE environment variable.

Understanding the Relationship Between Gemma and Gemini

One of the most common questions I get is about the difference between Gemma and Gemini. Both are from Google, but they serve very different purposes. Gemma is the open-weight model family designed for local deployment and customisation. Gemini is Google's proprietary cloud model integrated into Google Workspace, Bard, and the Gemini API. For a thorough comparison of capabilities, use cases, and when to choose which, see my dedicated Gemma 4 vs Gemini guide.

Troubleshooting: Common Errors When Running Gemma 4 Locally

These are the five issues that come up most often in my workshops when students try to install Gemma 4 for the first time. If the model will not start or feels unusably slow, the fix is almost always in this list.

Error: model 'gemma4' not found, try pulling it first — This means Ollama cannot resolve the tag. Either your Ollama binary is too old (pre-0.6 does not recognise the gemma4 tag) or you have typed a variant the registry does not publish. Fix: upgrade Ollama (on Mac/Linux: curl -fsSL https://ollama.com/install.sh | sh; on Windows: re-download the installer), then run ollama pull gemma4:4b or ollama pull gemma4:12b explicitly with the size tag rather than the bare gemma4 alias.
Out-of-memory crash mid-generation (model kicks you back to the prompt) — You have picked a variant too large for your available RAM or VRAM. The 12B model needs ~10 GB free, the 27B needs ~20 GB. Fix: drop one size down. Run ollama run gemma4:4b instead of 12B, or gemma4:12b instead of 27B. If you are on Apple Silicon, unified memory is shared with the OS and other apps — close browser tabs and IDEs before running the model.
Extremely slow first response (30–60 second stall before any tokens appear) — The model has been unloaded from memory after the default 5-minute idle timeout and is being reloaded from disk. Subsequent prompts within 5 minutes will be fast. Fix: keep the model warm by setting a longer keep-alive. Start the session with ollama run gemma4:12b --keepalive 30m or set the environment variable OLLAMA_KEEP_ALIVE=30m before launching. This trades a little idle RAM for instant responses.
Error: listen tcp 127.0.0.1:11434: bind: address already in use — Another Ollama instance (or a Docker container that embeds Ollama) is already holding the default port. Fix: check running processes — on Mac/Linux lsof -i :11434, on Windows netstat -ano | findstr :11434. Kill the stale process, or tell this instance to use a different port: OLLAMA_HOST=127.0.0.1:11435 ollama serve.
LM Studio loads the model but "GPU offload failed" and generation is CPU-slow — This usually means the selected model quantisation is too large for your VRAM, or the GPU driver does not support the required compute capability. Fix: in LM Studio, open the model's load settings and reduce GPU Offload Layers (try 20 or 30 instead of max), or switch to a smaller quantisation like Q4_K_M instead of Q6_K. On Windows with NVIDIA, also confirm CUDA Toolkit 12+ is installed; on Mac, LM Studio uses Metal automatically and needs no extra driver work.

Frequently Asked Questions

What hardware do I need to run Gemma 4 locally?

The minimum hardware depends on the model size. The 2B parameter variant runs on 8 GB of RAM with no dedicated GPU. The 12B variant needs 16 GB of RAM and runs best with a GPU that has at least 8 GB of VRAM. The 27B variant requires 32 GB of RAM and a GPU with 16-24 GB of VRAM for comfortable performance. Apple Silicon Macs with unified memory handle these models particularly well.

Is Ollama or LM Studio better for running Gemma 4?

Ollama is better if you prefer a command-line workflow, want to use the API for automation, or need to integrate Gemma 4 into scripts and applications. LM Studio is better if you want a visual chat interface, prefer downloading and managing models through a GUI, or are new to running local AI models. Both are free and both produce identical output quality for the same model.

How does local Gemma 4 compare to cloud ChatGPT in speed?

Cloud ChatGPT is faster for most users because OpenAI runs it on enterprise-grade hardware. Local Gemma 4 on a mid-range GPU (RTX 4070 or similar) generates roughly 15-30 tokens per second with the 12B model, which feels responsive for conversational use. The 27B model is slower at 8-15 tokens per second on the same hardware. The trade-off is that local runs have no rate limits and complete privacy.

Can I use Gemma 4 locally for Excel formula generation?

Yes, and it works very well. The 12B and 27B variants of Gemma 4 produce accurate Excel formulas including SUMIFS, INDEX-MATCH, nested IFs, and array formulas. You can prompt it through the Ollama CLI, through LM Studio's chat interface, or by building a simple script that calls the Ollama API. See my companion guide on Gemma 4 vs GPT-4o vs Llama 4 for Excel for detailed accuracy comparisons.