Siri, Use My Phone For Me

I'm not asking for Siri to gain the ability to generate a convincing image of a bullfrog wearing a three piece suit as it has a picnic on a lotus leaf. I don't need it to hallucinate the life story of a new emperor as I converse with it about Ancient Rome. I don't need Siri to help me generate ten clickbait-y headlines for this piece. I like the one I came up with.

All I want is for Siri to be able to use my phone as well as I can. At least. I want to direct Siri to press a button, scroll a list, find a label—actions performed countless times a day as we all use our phones. It's amazing my phone's intelligent assistant can't do that too.

Before the AI hype cycle reached its current fever pitch I would have continued to dream an empty dream but evidence is mounting that Apple may deliver something like that. Mark Gurman, long time Apple reporter and leaker dropped the following nuggets a few days ago.

Apple Inc. is planning to overhaul its Siri virtual assistant with more advanced artificial intelligence, a move that will let users control individual app functions with their voice… The new system will allow Siri to take command of all the features within apps for the first time.

The new system will allow the assistant to control and navigate an iPhone or iPad with more precision. That includes being able to open individual documents, moving a note to another folder, sending or deleting an email, opening a particular publication in Apple News, emailing a web link, or even asking the device for a summary of an article.

The new system will go further, using AI to analyze what people are doing on their devices and automatically enable Siri-controlled features. It will be limited to Apple's own apps at the beginning, with the company planning to support hundreds of different commands.

Perfect.

As a rule I preach moderation with leaks but I'm confident there's truth in this one. For a month and a half I've been meaning to write an article about a scientific paper Apple published about a technology called Ferret-UI. The working title it's had in my notes is "Apple's Ferret-UI is the most exciting AI-related feature I've read about them working on" and it's why I believe Gurman's leak.

So, what is Ferret-UI?

Apple's Ferret-UI is able to understand what a user is looking at on screen and provide information about that screen to a user in plain language. Examples from the paper show Ferret-UI answering simple questions about UI elements as well as complex queries like how to interact with what's on screen to achieve a specific goal.

A screenshot from Apple's Ferret-UI scientific paper that displays an iPhone with the Reminders App Store page on screen. On either side of the phone are example conversations a user is having with the software, which provide answers and context based on what is on the screen.

The paper states that being able to interact with your screen in such a way "is also a valuable building block for accessibility, multi-step UI navigation, app testing, usability studies, and many others."

Additional examples in the paper show Ferret-UI able to help users understand what the interactive elements on the screen are, what button to press to follow a podcast, where to navigate in the app to find requested information, and more.

A screenshot from Apple's Ferret-UI scientific paper that displays the New & Noteworthy screen from the Podcasts app. On the right side of the screen are a series of questions and answers about what is displayed. A screenshot from Apple's Ferret-UI scientific paper that displays the episodes list for Apple Events from the Podcasts app. On the right side of the screen are a series of questions and answers about what is displayed. A screenshot from Apple's Ferret-UI scientific paper that displays a screen of the Accessories section from the Apple Store app. On the right side of the screen are a series of questions and answers about what is displayed. A screenshot from Apple's Ferret-UI scientific paper that displays an open share sheet. On the right side of the screen are a series of questions and answers about what is displayed. A screenshot from Apple's Ferret-UI scientific paper that displays the New Event screen in the Calendar app. On the right side of the screen are a series of questions and answers about what is displayed.

Why go through the effort of teaching software how to understand user interfaces the way humans do without using it to improve how they can interact with those interfaces? User experience is one of Apple's core value props. You can disagree with how they execute against that but granting their devices the ability to be more accessible and functional couldn't be any more in line with that ethos.

The use cases are endless. The surface layer are the examples I opened with, which would make your phone useful in moments where you can't touch it. Related to that, imagine how much more useful CarPlay could become if using your phone hands free wasn't confined to a handful of strict incantations. How about asking your phone how to complete a task that you've never done before? If you're a family or friend group's go to tech support person, you, too, might be facing a layoff soon. And for folks who aren't as physically able to use their devices, this technology could enable them in ways current accessibility features can't. I won't even get into how useful this would be as part of an automated testing suite for developers using Xcode.

ChatGPT and its competitors are already impressive pieces of technology. Despite their ability to understand what we say to them they are still rather limited. They can deliver useful information but they leave the responsibility of what to do with it on us. While Rabbit's R1 is struggling to deliver on the Large Action Model it promised, the core concept is the kind of transformative relationship with our devices the term "AI" sparks in our imagination.

Smart tech should be able to do things so we don't have to. It should make hard things easier. It should give us time back. If the future of Siri is one that can deliver that then excitement is rational; it means a new way to interact with our devices that we haven't been able to take proper advantage of. We were promised it already; Siri was unveiled 13 years ago. Perhaps that's how long technology needed to catch up to that vision.

I can't wait. I just want to be able to tell Siri the specific dumb YouTube video to play next while I'm busy chopping ingredients in the kitchen on meal prep day.