How to Use Gemma 4 for Local AI Workflows

By Sagnik Bhattacharya 15 Mar 2026 5 min read

Coding Liquids blog cover featuring Sagnik Bhattacharya for using Gemma 4 for local AI workflows.

Gemma 4 running locally means your data never leaves your machine. That makes it practical for workflows involving sensitive files, proprietary code, or data you simply do not want to upload to a cloud API.

I teach Flutter and Excel with AI — explore my courses if you want structured learning.

This guide covers specific local AI workflows you can build with Gemma 4 — not just chatting with the model, but automating real tasks on your own files.

Quick answer

Run Gemma 4 locally with Ollama or LM Studio, then build scripts or pipelines that send your files and prompts to the local model. Common workflows include code review, document summarisation, data extraction, and batch text processing.

You need AI processing but cannot send data to external APIs.
You want to automate repetitive tasks on local files.
You are building internal tools that need AI without ongoing API costs.

Setting up Gemma 4 for workflow use

Install Ollama or LM Studio and pull the Gemma 4 model. For workflow automation, Ollama is the right choice because it exposes a plain HTTP API on localhost:11434 that you can call from any language — no SDK lock-in, no cloud relay.

Choose the right model size for your hardware. Gemma 4 12B works on most machines with 16GB RAM and handles extraction, summarisation, and code review cleanly. Drop to 4B on 8GB machines. Step up to 27B only if you have 32GB+ RAM and quality matters more than throughput — the 27B pass is roughly three times slower per request.

Three-minute setup check

Install Ollama from ollama.com/download, then pull the model: ollama pull gemma4:12b.
Verify the server is running: curl http://localhost:11434/api/tags — you should see gemma4:12b in the JSON response.
Install the Python client once per project: pip install ollama. That is the only dependency you need for every pattern below.

If you need a deeper walkthrough of the install, see how to run Gemma 4 locally. The rest of this tutorial assumes the three-step check above passes.

File processing workflows

The most immediately useful local AI workflow is batch file processing — sending multiple files through the model for analysis, extraction, or transformation.

Write a simple script that reads each file, sends it to the local model with a specific prompt, and saves the result.

Code review: analyse files for bugs, style issues, or security concerns
Document summarisation: condense long documents into key points
Data extraction: pull structured data from unstructured text files
Content tagging: classify or tag files based on their content

Four ready-to-run pipeline patterns

Every reliable local-AI workflow boils down to four shapes. Each one is a handful of lines of Python you can lift straight into a script. All four assume you ran the pip install ollama step above.

The shared helper used in every pattern below — keep it in one place and call it from each script:

from ollama import generate

MODEL = "gemma4:12b"

def ask_gemma(prompt: str, context: str = "") -> str:
    full_prompt = f"{context}\n\n{prompt}" if context else prompt
    return generate(model=MODEL, prompt=full_prompt)["response"]

Pattern 1 — Map: one prompt per file, results to disk

The workhorse pattern. Use it for summarisation, classification, code review, anything where each file stands alone. Writing the result to disk per iteration matters — if the batch crashes at file 73, you still have files 1 through 72 on disk.

from pathlib import Path

OUT = Path("summaries")
OUT.mkdir(exist_ok=True)

for path in Path("docs").glob("*.txt"):
    out_file = OUT / f"{path.stem}.md"
    if out_file.exists():
        continue  # resume-safe: skip files already done
    summary = ask_gemma(
        "Summarise in 3 bullet points. Return markdown only.",
        path.read_text(encoding="utf-8"),
    )
    out_file.write_text(summary, encoding="utf-8")
    print(f"✓ {path.name}")

Pattern 2 — Filter: route files by their content

Ask the model a yes/no question per file and move the file to a folder based on the answer. Useful for triaging log files, sorting support tickets, or flagging code that touches sensitive modules.

import shutil
from pathlib import Path

FLAGGED = Path("flagged"); FLAGGED.mkdir(exist_ok=True)
CLEAN = Path("clean"); CLEAN.mkdir(exist_ok=True)

for path in Path("inbox").glob("*.log"):
    verdict = ask_gemma(
        "Reply with exactly one word: FLAG or CLEAN. "
        "Reply FLAG if the log shows a stack trace, auth failure, or "
        "data-loss error. Otherwise CLEAN.",
        path.read_text(encoding="utf-8")[:6000],
    ).strip().upper()
    dest = FLAGGED if verdict.startswith("FLAG") else CLEAN
    shutil.move(str(path), dest / path.name)

Two tricks here: forcing single-word output keeps parsing trivial, and trimming to the first 6,000 characters prevents context-window overflow on huge log files.

Pattern 3 — Extract: unstructured text to structured JSON

Ask for JSON, parse it, and use it. This turns Gemma 4 into a schema-aware extraction tool that runs entirely offline — no third-party OCR-to-JSON service touching your contracts, CVs, or invoices.

import json
from pathlib import Path

rows = []
for path in Path("invoices").glob("*.txt"):
    raw = ask_gemma(
        "Extract this invoice as JSON with keys: vendor, invoice_number, "
        "date_iso, total_gbp. Return only the JSON object, no prose.",
        path.read_text(encoding="utf-8"),
    )
    try:
        rows.append(json.loads(raw.strip().strip("`")))
    except json.JSONDecodeError:
        print(f"⚠ parse failed: {path.name}")

Path("invoices.json").write_text(json.dumps(rows, indent=2))

Two habits that save hours: log the filename when JSON parsing fails (so you know which file to inspect), and strip backticks in case the model wraps output in a markdown code fence.

Pattern 4 — Map-reduce: documents larger than the context window

When a file exceeds Gemma 4's context window, split it into chunks, summarise each chunk, then summarise the summaries. This keeps the first pass cheap and lets the second pass see the whole document compressed into its key points.

def chunk_by_paragraphs(text: str, size: int = 20) -> list[str]:
    paras = [p for p in text.split("\n\n") if p.strip()]
    return ["\n\n".join(paras[i:i+size]) for i in range(0, len(paras), size)]

long_doc = Path("report.md").read_text(encoding="utf-8")
partials = [
    ask_gemma("Summarise in 5 bullet points:", chunk)
    for chunk in chunk_by_paragraphs(long_doc)
]
final = ask_gemma(
    "Merge these section summaries into one 10-bullet executive summary.",
    "\n\n---\n\n".join(partials),
)
Path("report-summary.md").write_text(final, encoding="utf-8")

Performance and context limits

Local models are slower than cloud APIs. Plan for 10-30 seconds per request depending on input length and hardware. For batch jobs, this adds up — 100 files at 20 seconds each is over 30 minutes.

Watch context window limits. Gemma 4 handles 8K-128K tokens depending on the variant. If your files are large, split them or summarise sections separately.

When to stay local vs. use an API

Use local Gemma 4 when privacy matters, when you are processing many files (avoiding per-token API costs), or when you need to work offline. Use a cloud API when you need the fastest possible response, the highest quality model, or when the data is not sensitive.

You can also build hybrid workflows — use local Gemma for initial processing and a cloud API for final quality checks on the most important outputs.

Worked example: local code documentation generator

You point a script at your project directory. It sends each source file to local Gemma 4, which generates a one-paragraph summary of what the file does. The script collects all summaries into a project overview document. Total cost: zero API fees. Total data leaked: none.

Common mistakes

Trying to process files larger than the model's context window without splitting — use Pattern 4 (map-reduce) above.
Using a model variant too large for your hardware (causes swapping and extreme slowdown) — verify GPU offload with ollama ps.
Building complex multi-step prompts instead of simple pipelines — three focused steps beat one ambitious prompt every time.
Accumulating results in memory instead of writing each one to disk — one crash and the whole batch is gone.
Restarting the Ollama server between requests — it reloads the model from disk every time, costing 10-40 seconds per call.

Troubleshooting batch workflows

Five errors account for almost every failed overnight batch. Check them in this order before reaching for a bigger debug tool.

"connection refused" on port 11434. The Ollama server is not running. On Windows, open the Ollama app; on macOS/Linux run ollama serve in a spare terminal. Confirm with curl http://localhost:11434/api/tags — it should return JSON, not an error.
"model not found: gemma4". You pulled a different tag than the script expects. Run ollama list to see what is actually on disk, then either pull the tag the script uses (ollama pull gemma4:12b) or change MODEL in the helper to match what you have.
Batch speed collapses after the first few files. The model is being unloaded between requests. Set OLLAMA_KEEP_ALIVE=24h in your environment before launching Ollama, so the model stays resident in memory for the whole batch.
Random JSON parse failures on Pattern 3. Gemma 4 occasionally wraps its output in ```json … ``` fences. Strip backticks before json.loads (the Pattern 3 snippet already does this), and tighten the prompt: "Return only the JSON object. Do not wrap in markdown."
GPU offload silently failed and the model is running on CPU. Check ollama ps — if the PROCESSOR column shows 100% CPU, the model did not fit in VRAM. Drop to gemma4:4b, or lower num_gpu layers via a custom Modelfile so part of the model runs on GPU and part on CPU.

FAQ

Which Gemma 4 variant should I use for batch file processing?

Gemma 4 12B is the best starting point on a 16GB machine — large enough to follow extraction instructions cleanly, small enough to finish a 100-file batch in reasonable time. Drop to 4B on 8GB machines, or step up to 27B if you have 32GB+ RAM and quality matters more than throughput.

How do I call Ollama from Python for a batch workflow?

Install the ollama Python package (pip install ollama) and call ollama.generate(model='gemma4:12b', prompt=...). That hits the local API on port 11434 and returns a dict — no network round-trip, no keys. For batch work, wrap the call in a loop over pathlib.Path.glob() and write each result to disk so a crash mid-batch does not lose finished work.

Why is my batch slow even on a powerful machine?

Three usual culprits: model reloading between requests (keep the Ollama server running and send sequential calls — do not restart it), context window bloat (trim each file down to what the prompt actually needs), and CPU fallback (verify GPU offload with ollama ps — if layers show 0% GPU, your model is too big for VRAM).

How do I split files that exceed the context window?

Chunk by semantic boundary, not by character count. For code, split per function or per class. For prose, split per H2 or per 10-paragraph block. Summarise each chunk with the same prompt, then run a second pass that summarises the summaries. Pattern 4 above implements exactly this map-reduce.

Can I run a Gemma 4 batch overnight without supervision?

Yes, but add three guardrails: write each result to disk immediately (do not accumulate in memory), wrap each call in a try/except that logs the filename and continues, and keep a simple processed.txt ledger so a re-run skips files already done. Pair that with OLLAMA_KEEP_ALIVE=24h so the model stays warm overnight and you do not pay a reload cost on every request.

When to use something else

To set up Gemma 4 locally, see running Gemma 4 on your own machine. For using Gemma 4 in your code editor, see Gemma 4 in VS Code.

Frequently asked questions

Which Gemma 4 variant should I use for batch file processing?

How do I call Ollama from Python for a batch workflow?

Install the ollama Python package (pip install ollama) and call ollama.generate(model='gemma4:12b', prompt=...). That hits the local API on port 11434 and returns a dict — no network round-trip, no keys. For batch work, wrap the call in a loop over pathlib.Path.glob() and write each result to disk so a crash mid-batch does not lose finished work.

Why is my batch slow even on a powerful machine?

Three usual culprits: model reloading between requests (keep the Ollama server running and send sequential calls — do not restart it), context window bloat (trim each file down to what the prompt actually needs), and CPU fallback (verify GPU offload with ollama ps — if layers show 0% GPU, your model is too big for VRAM).

How do I split files that exceed the context window?

Chunk by semantic boundary, not by character count. For code, split per function or per class. For prose, split per H2 or per 10-paragraph block. Summarise each chunk with the same prompt, then run a second pass that summarises the summaries. This map-reduce pattern keeps the first pass cheap and lets the second pass see the whole document in context.

Can I run a Gemma 4 batch overnight without supervision?

Yes, but add three guardrails: write each result to disk immediately (do not accumulate in memory), wrap each call in a try/except that logs the filename and continues, and keep a simple processed.txt ledger so a re-run skips files already done. Pair that with OLLAMA_KEEP_ALIVE=24h so the model stays warm overnight and you do not pay a reload cost on every request.

How to Use Gemma 4 for Local AI Workflows

Quick answer

Setting up Gemma 4 for workflow use

Three-minute setup check

File processing workflows

Four ready-to-run pipeline patterns

Pattern 1 — Map: one prompt per file, results to disk

Pattern 2 — Filter: route files by their content

Pattern 3 — Extract: unstructured text to structured JSON

Pattern 4 — Map-reduce: documents larger than the context window

Performance and context limits

When to stay local vs. use an API

Worked example: local code documentation generator

Common mistakes

Troubleshooting batch workflows

FAQ

Which Gemma 4 variant should I use for batch file processing?

How do I call Ollama from Python for a batch workflow?

Why is my batch slow even on a powerful machine?

How do I split files that exceed the context window?

Can I run a Gemma 4 batch overnight without supervision?

When to use something else

Frequently asked questions

Which Gemma 4 variant should I use for batch file processing?

How do I call Ollama from Python for a batch workflow?

Why is my batch slow even on a powerful machine?

How do I split files that exceed the context window?

Can I run a Gemma 4 batch overnight without supervision?

Related tutorials on this site