Designing app features with an on-device LLM

20 June 2025

One of the really cool announcements from WWDC got overshadowed by everything else, but I think it’s a big deal and deserves attention. Apple released Foundation Models — a framework providing a local, on-device LLM. It’s available to all apps on iOS, iPadOS and macOS on devices running the latest developer betas with Apple Intelligence enabled.

The cool thing about any local LLMs is that their usage is private, free, unlimited, and there’s no latency penalty because you don’t have to send data back and forth to a third party. Designing and developing with Apple’s Foundation Models also requires no additional setup beyond downloading Xcode — other models require something like Ollama to run locally, you have to decide which model to download, and then figure out how to make it available to your software.

All of this makes designing LLM-enabled features for Apple devices much more accessible. I wanted to explore what I could do with this tech on one of my personal projects.

I’m working on a link bookmarking app for iOS and macOS. I’ll show it to you some other time, but for now let’s just say I wanted to improve the search experience:

finding bookmarks by topics mentioned on the page, not just by matching keywords in the title or tags
scoping search using time-based filters, e.g. looking for bookmarks from last week or last year

Using an LLM for implementing topic search

I already have basic search implemented. Typing in the search field brings up any bookmarks with matches in the title, the URL, or the user-added tags. But sometimes those fields don’t represent the contents of the page very well. Here’s an example.

A screenshot of an article titled Experimenting is above all a process on 205.ft — Title: Experimenting is above all a process
URL: https://www.205.tf/articles/experimenting-is-above-all-a-process

This is an article about typography, but you wouldn’t know it from looking at its title or the URL. Let’s say I didn’t add any tags, so searching for “typography” wouldn’t work. Well, not unless there was a way to search through a summary of topics covered in the article.

So I tested passing the webpage’s content to the model, and asking it to generate a list of keywords representing the content. Setting up the LLM to do this was easy, but it took me a while to find a set of instructions that resulted in sensible keywords. One of the issues I’ve seen a lot with LLM summarisation is focusing on irrelevant aspects of the text, and discarding useful detail. Steering the LLM to focus on creating as many unique tags as possible seems to have worked okay here.

A list of topics generated by the LLM for the article: typefaces experimentation designers typography fonts
typeface innovation geometric design variability digital tools graphic design modular fonts variable fonts letterforms
typographic classifications type foundry

This task takes the LLM about 7 seconds. This feature is nice to have, an enhancement, so this is acceptable. I can run the process in the background every time a new bookmark is added without it causing a visible delay to the user.

But I wondered whether there are more established Natural Language Processing techniques which could summarise text in this way. A quick search later I found an algorithm called TextRank, designed for this task, with some open source libraries for iOS etc easily available. I’ll try experimenting with it, mainly because the Foundation Models framework depends on the availability of Apple Intelligence, limiting the feature to more recent devices.

I have a UI for filtering search results to the bookmarks saved today, in the last week, last month, or last year. I want to be able to type into the search field something about roughly when the bookmark was created—“two weeks ago typography article”— and for the app to select the closest time filter.

I set up the LLM to take in the search query, and split it into the time description and the actual search term, and use the time description to select the closest time filter. I assumed this would be something the LLM would do really well, but I struggled to get correct results. Separated the query from the time description was the easy bit, but I couldn’t get the LLM to reliably pick the right filter from the ones I supplied.

A screenshot showing the results of ten attempts at matching the closes filter against the typed in query — Out of the available filters (all time, yesterday, last week, last month, last year) I would pick “last month” here, but I struggled to get the LLM to do the same

If I wanted to make it work with this tech, I would try splitting this tasks into two steps:

Separate search query and time descriptions
Use the time description to pick one of the time-based filters

I have a hunch that splitting things into smaller steps makes it easier to keep the LLM on track.

In the end I decided to use a different approach. On average the LLM would take 0.76 of a second to generate a result. If this delay happened inside a chat UI that would be fine, but in a regular search UI that feels broken, even if there’s a loading indicator. So the LLM was clearly not a great choice for this feature.

I decided instead to try parsing the natural language using a library designed specifically for this kind of task. I knew this was a solved problem, because Fantastical, a calendar app which came out in the early 2010s, had the feature for creating events using natural language, like “breakfast with Rob on Tuesday at 8”. And it was always quick.

I found a library called SwiftyChrono which does exactly this. It works really well, and I can select the right time-based search filter quickly and correctly every time. Done.

Checking whether the LLM is producing the outputs I want is very hard

LLMs won’t always give the same output in response to the same input (unless you configure them to, which has other trade-offs), and if you’re designing something with the outputs you want to know what you’re working with.

In particular you might want to know: can this prompt produce reliably good output with the inputs I am providing? What does it mean for it to be reliable? When is it good?

The only way I knew to answer those kinds of questions was to build a test app where I could run the same LLM task multiple times with the same prompt. I could then see if the results made sense. Even then, I would sometimes run the test once and the outputs looked good, only to run it again and get a much worse result.

An animation showing different results generated by my testing rig in repsonse to the same prompt — A test tool I made to check what outputs I was getting

This is a new way of testing stuff out for me. With more typical programming tasks I can quickly interact with the interface I built, see how it behaves, confirm it does what I want, and then never again worry about it (until I break it later). With an LLM I might get a good output, but that doesn’t guarantee that the next time I run it I’ll get another good output. Some kind of testing rig is essential to being able to understand if I got my prompt right, and how the LLM behaves with different types of data.

It’s hard to understand the impact of every prompt change

Every small change to the prompt can have a big impact on the LLM’s behaviour, and there’s no real way to peek at what’s happening under the hood. Without a testing rig it would be impossible to tell if a tweak I made massively improved things, or whether I just chanced upon one decent result without changing the overall output quality very much.

Lots of very smart people have been working out techniques for coaxing LLMs to behave in specific ways, so by now there’s a lot of folk knowledge about prompting techniques for dealing with certain problems. But those techniques can sometimes be model-specific and not guaranteed to work. I don’t know if there’s a more structured way to craft a reliable prompt other than doing lots of experiments. I found keeping track of all of the experiments very complicated, and wasn’t always sure if my changes represented progress.

I am hoping for tools that will make it easier to design with real data

Having a local, private, and unlimited model means it’s a lot easier to explore enhancements that an LLM could bring to an app experience. In my opinion a lot of potential enhancements would simply not be worth the loss of privacy that would come with using hosted models—not to mention the other externalities.

But for this exploration potential to be realised I think there need to be easier ways to understand what the LLM can produce, not just once, but enough times to build confidence that a good result isn’t just a fluke. Obviously we’re in the early days of having local models available to your software, and I hope that tools will emerge for testing LLM outputs with real data so it’s easier to explore their design potential.

Nat Buckley

designer

they/them