Exploring LLMs Locally

I recently purchased a new laptop: Lenovo LOQ 15ARP9

CPU: AMD Ryzen 7435HS
GPU: RTX 4060  8GB VRAM
RAM: 24GB

I was under the impression that “Finally, I can build something great”. But my primary intention was Gaming, LOL

Back when I was working as a software engineer in a corporate setup, my work machine was too limited to run local models effectively — even though Ollama was already gaining popularity at the time. I couldn’t integrate them into AI applications or n8n automation workflows the way I wanted.

I was under the assumption that local models were ready to run and use perfectly like ChatGPT and Claude, like what and how they were able to run the tasks automatically (like web search, file executions and so on).

My assumptions were fully shattered when I started to build an application by integrating LLM models. First of all, it only serves as a query-based system, meaning it only answers questions that I ask. After the session ends, it doesn’t remember anything.

That’s when it came to my understanding that services like ChatGPT and Claude require access to tools to enhance the output. The initial idea was to create a simple version of RAG which answers based on current affairs data. I have been using GROQ API for the initial testing phase, but it has its own limited set of models and rate limits hit way faster than I thought. That’s when I thought of going local.

There were multiple options in Ollama, after greater consideration, I downloaded gemma4:e4b (that supports text and image). At that time, I was only working with text-based reply and response and image support was only for future proofing. What I didn’t understand at that time was that gemma4 also supported configurable thinking modes — it can reason through a problem step-by-step before answering.

I started asking questions by locally running the model in the terminal, the thinking mode turned on and it started thinking… for 5 minutes. I was shocked to see this because I thought GPU computation should be faster than this. Then it came to my understanding that, without proper drivers, the GPU was not detected properly in my device and the model was loading in CPU.

Luckily, I had my companion, Claude Code — it was a matter of seconds to install and load the necessary GPU drivers. Then I tested the session. The difference was day and night. For the same query, for thinking + answering, it took only 20s.

That 20s changed my perspective on building with AI. The model was fast, it could reason, it ran entirely on my machine. My gaming beast had become a research lab. Now the real question was - What to build with it?