Making Sense of AI

There’s a ton of competition in the AI space, you might have already noticed this. Every few months, a new model drops doing something wild—maybe one day it’s a leap in capabilities, the next it’s a new hardware accelerator that makes inference blazing fast.

But here’s the thing: with each big announcement, the signal often gets drowned out by the noise. Especially when you’re trying to understand what the tech actually does.

It reminds me of learning a new programming language. I usually pick a tool, stick with it, learn how to extract value from it—and only move on if I’ve done my homework and know why I’m switching. AI is no different. Instead of chasing every shiny model, I ask:

  1. What problem can I solve using AI?
  2. Can that solution scale long-term?

Also worth remembering: as soon as you commit to a tool or platform, you’re laying the groundwork for its future maintenance. It will become legacy eventually. If you’re using it in something mission-critical, make sure it’s a decision you’re willing to stand by. So it’s never a bad idea to stop and think carefully about the type of coupling you’re introducing.

So how do you even choose the right piece of tech? I’ve stopped thinking of it as a bet. These days, I just stick with what works well enough for me. That mostly means keeping an eye on how Microsoft is evolving its AI portfolio on Azure—maybe not the flashiest approach, but it’s been consistent in all the right ways. Still, I try not to lock myself into one ecosystem. I like to keep a pulse on what others are doing too.

✅ Disclaimer

I’m sharing this blog as a bit of a departure from my usual style because, honestly, trying to capture everything I’ve learned about AI in one go feels almost impossible. Instead, this is more of a snapshot of where I am in my learning journey. If it adds value, great. If not, well, there’s always a chance I’m just adding more noise to the mix.

Pre-ramble: my AI learning spiral

Over the last few years, I’ve found myself going through phases with AI. Like anything new, the start was rough, just a mess of possibilities and no clear way to begin. For every explanation I found, there was another that either pulled me deeper down the rabbit hole or made me question whether the source was credible at all. But I’ve learned not to fight that part, that confusion is just part of the process. The only way I’ve been able to move forward is by narrowing my focus and taking it one byte at a time. And honestly, that’s been enough.

I don’t think I’m alone in this spiral—maybe you’ve hit some of these too:

The “what even is this?” phase:

The “realizing there is more” phase:

  • Wait, “Prompt engineering” is a real job now?
  • People are talking about “foundation models,” whatever those are

    • Versioned dependencies that you use in your AI workload, so they have a life cycle that you need to keep in mind.
  • LLMs? Oh, large language models. Sure. That clears everything up.
  • And there are actual model hubs where people just
 post these?
  • Apparently a model goes through two main stages:
    • Training - where it learns how to do stuff by figuring out relationships, encoded as weight values in the model.
    • Inferencing – where it tries to apply that learning to (potentially unknown) real-world inputs, in real-time.
  • What are embeddings, and what exactly are they embedding?
  • Okay, vector databases store those. Makes sense.

The “Falling deeper into the rabbit hole” phase:

For context: I have zero formal background in Machine Learning or AI. I’m not a data scientist or mathematician. I’m a dev/ops/security/architect hybrid—generalist by nature, specialist when needed. Diving into AI meant facing how much I had to learn. Worse, I knew I’d need to learn even more just to figure out what else I didn’t know.

So I started wherever it made sense and I pivoted often. I aimed to grasp fundamentals but also to have fun. I played with different models and approaches, mostly around text and vision.

AI on Azure: A quick anecdote

Back in the day, I’d worked with Azure Cognitive Services before it evolved into Azure AI Services. I still remember this quote from one of the early blog posts:

📖 Quote:

Azure Cognitive Services brings artificial intelligence (AI) within reach of every developer without requiring machine learning expertise. All it takes is an API call to embed the ability to see, hear, speak, understand, and accelerate decision-making into your apps.

That line has stuck with me. And honestly, it still holds up—maybe even more now than when it was written in 2019.

You don’t need to go through the full AI learning spiral I described earlier to get value out of Azure’s AI offerings. Cognitive Services abstracts away most of the complexity, letting you hook in your apps with just a handful of API calls. Getting started on Azure has only gotten easier over time. (Though, yes, in early 2024 you still had to fill out a form just to get access—thankfully, that’s no longer the case.)

A while back, I helped a team build a proof-of-concept using a RAG (Retrieval Augmented Generation) pipeline to identify potential patent infringements in draft documents. The setup used Azure OpenAI (GPT-3.5-Turbo) and Azure Databricks to handle the logic. We used Azure OpenAI and made it available to our Azure Databricks engineers. It didn’t take much to get things working—and that’s a credit to the folks at Microsoft who clearly put thought into making it all painless.

This is all it takes to provision an Azure OpenAI instance via Terraform:

resource "azurerm_resource_group" "example" {
  name     = "example-resources"
  location = "West Europe"
}

resource "azurerm_cognitive_account" "example" {
  name                = "example-ca"
  location            = azurerm_resource_group.example.location
  resource_group_name = azurerm_resource_group.example.name
  kind                = "OpenAI"
  sku_name            = "S0"
}

resource "azurerm_cognitive_deployment" "example" {
  name                 = "example-cd"
  cognitive_account_id = azurerm_cognitive_account.example.id

  model {
    format  = "OpenAI"
    name    = "gpt-35-turbo"
    version = "0125"
  }

  sku {
    name = "Standard"
  }
}

Once deployed, you just plug your code into the endpoint and you’re off to the races. This is pretty simple to get started with, and already adds a lot of value. But the best part is that you can also opt-in to a bunch of enterprise-grade features:

If you place Azure API Management in front of your newly deployed Azure OpenAI instance and integrate it, you’ve effectively built a “Generative AI gateway.” This proven architectural addition is ideal for scaling, as it enables features like policy enforcement, request/response transformation, metrics collection, and enhanced reliability — all without needing to modify your application.

💡 Note

Microsoft offers plenty of great architecture resources for AI/ML workloads, including a detailed guides on the Generative AI gateway, which you can explore here.

Self-hosting LLMs

While Azure’s OpenAI service offers a lot of great features and lets you quickly get started, there might be cases where you want to take a different route… Like self-hosting your own LLM inferencing platform that behaves similarly to a PaaS offering.

There are valid reasons for either approach:

  • With Azure OpenAI, once the instance is provisioned, it’s basically hands-off. Depending on your usage patterns, it might be cheaper than spinning up your own VM and dedicating full-time employees (FTEs) to manage and maintain it.
  • But, as with any PaaS that runs on a consumption model, you’re exposed to unbounded consumption attacks—where an attacker might flood your endpoint just to rack up charges and drain your budget.
    • There are various mechanisms designed to prevent these kinds of attacks from causing widespread disruption, including rate limiting, tighter access controls, and caching.

I really want to highlight that, that convenience, however, should not be underestimated. Not having to maintain your own LLM inferencing platform is a huge operational win.

💡 Note

If you’re curious about the vulnerabilities involved in self-hosting LLMs, definitely check out the OWASP Generative AI Security Project. It’s a solid resource for anyone building, shaping, or securing generative AI systems.

Self-hosting LLM inferencing platforms is no small feat, it demands quite a bit of knowledge across a pretty broad spectrum:

Luckily, Azure offers a lot of tools to support this kind of setup—whether you use Azure Kubernetes Service (AKS) Automatic with built-in features like node auto-upgrades and identity integration, or lean on services like Automatic Guest Patching for Azure Virtual Machines and Scale Sets, or Linux VMs with Entra ID + OpenSSH login. So, at least on the infrastructure side, Azure offers a range of features that can make the self-hosting journey not only more manageable but potentially even a great experience.

Now, I know it sounds like I’m nudging you toward always using Azure OpenAI, but I’m really not. I’m just pointing out how nuanced and complex it is to host LLMs in a production-ready, secure, and scalable way. If you and your team are confident in managing that infrastructure, then by all means… Go for it.

Hardware still shapes the outcome

Ever wonder what really happens when you send a prompt to an AI model? Imagine you’re running a containerized app in Kubernetes that calls a GPT service—what hardware wakes up, and in what order?

Let’s unpack a few things first:

  • GPUs aren’t the only game in town, since they excel at performing parallel math calculation, but

  • CPUs can hold their own especially with smaller models—performance varies by core count, memory, and model size.
  • Inference is a team sport. Even a GPU-driven pipeline will rely on the CPU for orchestration, batching, I/O, and pre/post-processing.

Hardware is just another piece of the puzzle—your inference engine (vLLM, llama.cpp, etc.) decides how that hardware is used. Engines differ in:

  • Operator scheduling
  • Memory management
  • Quantization support
  • A bunch of other things that are probably pretty specific to the engine you’re working with.

Whether you’re running on a GPU or a CPU, there are a handful of things that will have a direct impact on how your model behaves—both in terms of output quality and responsiveness:

  • Model choice: Different models come with different architectures and trade-offs. A lightweight 7B model won’t behave the same as a 65B heavyweight. That shows up in both the output and the system resources you’ll need to keep it running smoothly.
  • Parameter count: More parameters typically give you better generalization and more coherent responses. But they also demand more memory, more compute, and more patience.
  • System prompt: Prompts matter—a lot. A well-thought-out system prompt, combined with relevant context, can seriously improve the relevance and quality of your model’s output. (Not so very hardware related, I apologize, though it is still important.)
  • Context window: Kind of similar to a process’ working memory. The size of the model’s context window limits how much input tokens (including prior conversation history, documents, etc.) you can feed into a prompt. Bigger windows allow for more complex queries and better continuity, but they also require more compute to process efficiently. The longer your conversation goes on the language model might start to forget some details.

It goes without saying that all the other tunables your inferencing engine, like vLLM, exposes will also impact performance. The trick is to find that sweet spot: enough performance to meet your needs, without blowing up your compute bill or overengineering the whole thing.

💡 Note

Wait, you mentioned CPUs. Shouldn’t we use GPUs for running AI workloads?

If your budget allows for it, sure! Look, if you’re aiming for near real-time text generation, then yes—you want an Azure Virtual Machine SKU with a supported NVIDIA GPU and enough VRAM to load both the model and its context. But while GPUs are excellent for performance, they’re not strictly required to host an LLM.

You can absolutely run smaller models on x64 or arm64 CPUs—just make sure your inferencing engine supports your use case.

  • x64 architecture: you’ll want a CPU that supports AVX instruction sets (Advanced Vector Extensions). Newer generations like AVX10, APX, and AVX-512 are increasingly common in modern silicon and offer solid acceleration for inference workloads.
  • arm64 architecture: check which SIMD or vector extensions the CPU supports. Most have NEON (128-bit fixed-width vectors), which works well for lightweight inference tasks—great for microcontrollers or low-intensity workloads. For heavier jobs, newer Armv9-A-based chips may support SVE2 and SME2, which offer scalable vector operations and noticeably better performance for larger models.

The Importance of Quantization

So we’ve established that you can absolutely run AI model inference on a CPU, without needing a GPU at all. For fast prototyping or lighter workloads, one approach is to use a quantized model. This is basically a version of the model that’s been compressed down to use smaller numerical types.

I remember asking myself early on: “just what even is quantization?” After a bit of digging, I found out it’s all about reducing the computational and memory overhead of a model by lowering the precision of the numbers used in its weights and activations.

Think of quantization like adjusting the graphics settings in a video game.

  • At Ultra, you get crystal-clear visuals—but your hardware might struggle.
    • At the very least, you end up setting aside a lot of compute resources for the game to use.
  • Dial it down to High, Medium, or Low, and the performance improves
    • But you lose a little (or a lot) of visual detail.

Same idea here: full precision gives you max accuracy, but it comes at a cost.

Most pre-trained models you find online store their weights in FP32 (32-bit floating point). But with quantization, you can convert those weights to more compact formats like FP16, INT8, or even INT4—and that unlocks a huge performance boost in many scenarios.

📖 Quote

PyTorch’s docs on numerical accuracy note that floating-point operations often don’t produce bitwise identical results—even if they’re mathematically equivalent. This holds across platforms, versions, and even between CPU and GPU. So don’t be surprised if your results vary slightly depending on where (and how) your code runs.

To put it simply:

  • Higher precision (FP32, FP64) → better accuracy, but slower and heavier on memory.
  • Lower precision (FP16, BF16, INT8, INT4) → faster and lighter, but potentially a slight hit on accuracy.

And that slight hit? Give it a test drive, you might find it to be completely acceptable—especially in real-time applications where speed matters more than perfect accuracy. The real challenge is tuning for the right trade-off based on your use case.

Azure SKUs and Quantization

At first, I figured I could just pick any hardware—CPU or GPU—and it would support whatever data types I needed. But that assumption turned out to be wrong.

It reminded me of the old-school days of computing, where something like an int or a long might have a different bit width depending on the system architecture. You’d move code from one machine to another and suddenly things would behave differently—or break entirely.

Same deal here. You’ve got to make sure your hardware actually supports the quantization method your inference engine expects. Otherwise, the model either won’t run at all, or it’ll run poorly and waste your time in the process.

Take vLLM as an example: it doesn’t support INT8 quantized models on Nvidia GPUs based on the Volta architecture—like the ones in Azure’s NCv3 or NDv2 series.

💡 Note

Double-check that your hardware matches the quantization type required by the model — and confirm that the inference engine supports it. It might save you hours of head-scratching over why nothing’s working. 🙂

TL;DR: Thanks to quantization, you don’t necessarily need a monster GPU to run AI workloads. Depending on what you’re building, running models on a CPU can be just fine. But you’ve got to understand what your model expects—and what your VM is actually capable of delivering.

Hidden Stack Booster Modules

It’s not just the CPU or GPU that matters. There are performance optimizations happening across the entire stack—often invisible to end users because they’re neatly abstracted away. But they’re there, quietly doing heavy lifting.

If you’ve watched infrastructure deep-dives from Build or Ignite over the past few years, you might’ve noticed that Azure engineers aren’t just wiring things together—they’re designing their own silicon. Custom chips like Maia 100 and Cobalt are purpose-built to accelerate AI workloads, especially inference.

Why? Because while training is expensive, inference has the potential to dwarf that cost. Every time you run a model—whether it’s generating text, recognizing an image, or answering a query—there’s a cost, and not just in cloud credits. It’s measured in kilowatt-hours. Multiply that across billions of interactions, and suddenly optimizing inference efficiency becomes a major architectural priority.

That’s why these hidden stack boosters matter, too. Even if you don’t see them, they’re just another piece of the puzzle making sure things run as smoothly as they can be.

Retro

Looking back on this little AI journey, I can honestly say: I’m still figuring it out. And I probably will be for a long time. That’s just how it goes with tech that moves this fast — by the time you think you’ve “caught up,” there’s a good chance the next wave is already rolling in. Boom — there goes another month’s worth of evenings, just figuring stuff out.

But here’s the surprising part: that’s totally fine. You don’t have to know everything. You don’t have to chase every new framework, every headline model, every hype cycle. The important part is staying curious and finding the bits that matter to you — the pieces that make you want to poke around, experiment, or just see what happens.

So yeah, I’d say I’m still spiraling — but it’s a spiral that’s moving forward, not just looping in circles. 😁 And honestly? That’s about as good as it needs to be.