Privacy-First Wearable AI: On-Device Voice Capture

Every wearable AI product on the market right now ships your voice to someone else's server. Limitless, Rewind, Plaud, Bee. They record your conversations, upload the audio to a cloud endpoint, and run inference on hardware you will never see, in a jurisdiction you did not choose. The pitch is always compelling: ambient capture, perfect memory, searchable transcripts. The trade-off is always the same: your most private data, streamed to infrastructure you do not control.

We wanted to know whether that trade-off is actually necessary. This is not a client project or a product launch. It is a research exploration into whether viable, useful voice capture can happen entirely on-device, with no cloud dependency, no app lock-in, and no data leaving the hardware in your pocket. We investigated the hardware platforms, on-device speech models, and architectural decisions required to build a privacy-first wearable AI capture device in New Zealand.

Why should New Zealand professionals worry about cloud-based AI wearables?

Privacy-first wearable AI matters because current cloud-dependent devices transmit intimate personal conversations, meeting content, and ambient audio to offshore servers where data sovereignty, retention policies, and access controls are governed by foreign jurisdictions rather than the user. For New Zealand professionals handling sensitive client information, this creates compliance risk and a fundamental loss of control.

The appeal of wearable AI capture is real. A device that passively records your day, transcribes conversations, extracts action items, and builds a searchable archive of everything you said and heard. For professionals juggling client meetings, site visits, phone calls, and the thousand small decisions that fill a working day, the promise of perfect recall is genuinely valuable. Lawyers. Property managers. Consultants. Tradespeople quoting on-site. Anyone whose work involves spoken information that needs to become written records.

The problem is not the concept. The problem is the implementation.

Every commercial wearable AI device we examined follows the same architecture: record locally, upload to cloud, process remotely, return results. The audio leaves your device within seconds or minutes. Once it does, you are trusting the vendor's privacy policy, their security practices, their data retention decisions, and the legal framework of whatever country their servers sit in. For most of these products, that means US jurisdiction and US data law.

For a New Zealand business owner, this is not an abstract concern. If you are a property manager recording tenant interactions, a lawyer capturing client instructions, or a consultant documenting discovery sessions, that audio contains information governed by the Privacy Act 2020. Sending it to a US cloud endpoint creates a cross-border data transfer that most small business owners have not thought through, let alone obtained consent for.

Then there is the lock-in problem. Your transcripts, your summaries, your searchable memory archive: all of it lives inside the vendor's ecosystem. If they raise prices, change terms, get acquired, or shut down, your data goes with them. You built a dependency on a service that owns the most personal dataset you have ever created.

We see this pattern across our consulting work: businesses adopting tools without understanding the data architecture underneath. The convenience is immediate. The risk compounds silently. By the time someone asks "where does my data actually go?", the answer is already uncomfortable.

The weekend cost here is different from our trades automation projects. It is not about hours lost to paperwork. It is about a creeping erosion of privacy that accelerates every time you clip on a device and forget it is streaming your life to a server farm in Virginia.

How does EmbedAI approach fully offline voice AI on a wearable device?

EmbedAI's research into on-device wearable AI focused on three technical pillars: selecting hardware platforms capable of running speech models locally, evaluating on-device speech-to-text models that operate without network connectivity, and designing a processing architecture where voice data never leaves the physical device unless the user explicitly exports it.

We worked backward from a single design constraint: the audio never leaves the device. Not temporarily, not for processing, not for backup. If the device cannot transcribe, summarise, and organise voice data entirely on its own silicon, the architecture fails. This constraint eliminated every cloud-first approach immediately and forced us into the edge AI processing space, where the hardware is the entire compute environment.

What hardware platforms support on-device speech processing?

ESP32 and nRF microcontroller families emerged as the two most viable hardware platforms for a privacy-first wearable AI capture device, offering sufficient processing power for on-device speech-to-text inference while maintaining the low power consumption and compact form factor required for all-day wearable use.

We evaluated hardware across three axes: processing capability, power budget, and form factor. A wearable capture device needs to run for a full working day on a single charge, fit comfortably in a pocket or clip to clothing, and have enough compute to run speech models in real time.

The ESP32 family, particularly the ESP32-S3, offers dual-core processing at 240MHz with vector instructions that accelerate neural network inference. It is cheap, well-documented, and has a mature development ecosystem. The power draw is manageable for battery-powered applications, and there are existing reference designs for audio capture with onboard MEMS microphones.

The nRF5340 from Nordic Semiconductor takes a different approach: dual-core ARM Cortex-M33 with a dedicated application processor and a network processor. The network processor is irrelevant for our offline use case, but the application core is powerful enough for lightweight inference tasks. Nordic's platform excels at ultra-low-power operation, which directly translates to battery life.

Neither platform is a phone. Neither has a GPU. The compute budget is measured in milliwatts, not watts. This constraint shaped every subsequent decision about model selection and processing architecture.

Which on-device speech models work without cloud connectivity?

Sherpa ONNX emerged as the most promising framework for on-device speech-to-text on microcontroller hardware, providing optimised inference for small speech recognition models that run entirely offline without network access, cloud API calls, or external dependencies.

We evaluated on-device speech models against four criteria: model size (must fit in flash memory), inference speed (must keep up with real-time speech), accuracy (must produce usable transcripts), and framework maturity (must have a viable path to deployment on embedded hardware).

Sherpa ONNX, from the Next-gen Kaldi project, stood out for several reasons. It provides pre-trained speech recognition models in multiple sizes, from tiny models suitable for microcontrollers to larger models for more capable hardware. The ONNX runtime handles inference optimisation across different hardware targets. And critically, the entire pipeline runs offline. No network calls. No API keys. No telemetry.

The models we tested ranged from streaming models that transcribe speech in near real-time to non-streaming models that process complete utterances for higher accuracy. For a wearable capture device, the streaming approach is more natural. You speak, and text appears. There is no "upload and wait" step because there is nothing to upload.

Accuracy is the honest trade-off. A Sherpa ONNX model running on an ESP32 will not match the transcription quality of OpenAI's Whisper running on cloud GPUs. New Zealand accents, industry-specific terminology, and noisy environments all challenge smaller models more than larger ones. But the question we were testing was whether the output is useful, not whether it is perfect. A transcript that captures 85% of words accurately is still a searchable, reviewable record of a conversation. It is infinitely more useful than a perfect transcript that lives on someone else's server.

How does the processing architecture keep data on-device?

The on-device processing architecture treats the wearable as a self-contained compute unit where audio capture, speech-to-text inference, and structured data storage all happen on the same hardware, with data export controlled entirely by the user through a physical connection or local wireless transfer.

The architecture has three layers, each running on the device itself.

Capture layer. A MEMS microphone feeds audio into a circular buffer. The buffer holds the most recent audio segment, overwriting older data continuously. This means the device is not accumulating hours of raw audio in storage. It processes as it captures.

Inference layer. The Sherpa ONNX runtime consumes audio from the buffer and produces text transcripts. On ESP32 hardware, this runs on one core while the other handles audio capture and system management. The dual-core architecture of the ESP32-S3 was a key factor in platform selection: it allows capture and inference to run in parallel without blocking each other.

Storage layer. Transcripts are written to local flash storage or an SD card in plain text or structured JSON. The user owns this storage physically. They can read it, delete it, export it via USB, or sync it to their own computer over a local connection. At no point does the device initiate a network connection to transmit data elsewhere.

This architecture means the device works in a workshop, on a building site, in a basement with no cell coverage, or on a rural property where the nearest cell tower is a suggestion rather than a guarantee. For New Zealand businesses outside the main centres, offline capability is not a luxury feature. It is a basic requirement.

What did the proof-of-concept demonstrate about offline voice capture?

The proof-of-concept demonstrated that viable, useful voice capture and transcription can run entirely on microcontroller hardware without cloud connectivity, producing searchable text records of spoken content while keeping all data physically on the device under the user's sole control.

The research validated the core thesis: you do not need a cloud to build a useful voice capture device. The trade-offs are real. Accuracy is lower than cloud alternatives. Processing power limits the sophistication of post-transcription analysis. The user interface is necessarily simpler when you cannot lean on a cloud-hosted app with unlimited compute behind it.

But the thing that matters most works. Speech goes in. Text comes out. The text stays on the device. Nobody else gets a copy.

The research also surfaced insights that inform our broader AI consulting practice. On-device AI is advancing rapidly. Models that required server hardware two years ago now run on chips that cost less than a coffee. The gap between cloud and edge capability is closing, not because edge devices are getting dramatically more powerful, but because model architectures are getting dramatically more efficient. Quantisation techniques, knowledge distillation, and purpose-built inference runtimes like Sherpa ONNX are making useful AI possible on hardware that fits in your pocket.

For New Zealand specifically, the offline angle matters more than it does in markets with ubiquitous connectivity. Rural properties, construction sites, workshops, and any environment where Wi-Fi is not a given. We encounter this reality constantly in our trades and SME work. Solutions that depend on a stable internet connection are solutions that fail in the places where New Zealand businesses actually operate.

This remains a research project. We have not productised it. We have not built a consumer device. What we have is validated evidence that the privacy-first approach to wearable AI is technically viable on current hardware, and that the accuracy trade-off is narrowing with each generation of on-device models. The next step, when we take it, will be informed by which specific NZ use case demands it most urgently.

The wearable AI market is moving fast. Limitless raised significant funding. Plaud sells hardware at volume. The assumption baked into every one of these products is that cloud processing is necessary. Our research suggests it is convenient, not necessary. And for a growing number of professionals who handle sensitive information daily, convenience is not worth the privacy cost.

What is the technical stack for on-device wearable AI?

ESP32-S3 — Primary microcontroller platform evaluated for on-device voice capture. Dual-core 240MHz processor with vector instructions for neural network inference, MEMS microphone integration, and battery-powered operation.

nRF5340 — Alternative ultra-low-power microcontroller platform from Nordic Semiconductor. Dual-core ARM Cortex-M33 evaluated for extended battery life applications where power budget is the primary constraint.

Sherpa ONNX — On-device speech-to-text inference framework from the Next-gen Kaldi project. Provides optimised models in multiple sizes for offline transcription without cloud dependency or network access.

ONNX Runtime — Cross-platform inference engine that optimises model execution across different hardware targets. Handles the translation between trained speech models and the specific instruction sets of microcontroller processors.

MEMS Microphone — Micro-electro-mechanical system microphone for audio capture. Selected for low power consumption, small form factor, and direct digital output compatible with microcontroller ADC inputs.

Local Flash / SD Storage — On-device storage for transcripts and structured data. User-owned, physically removable, and accessible only through direct connection. No cloud sync, no remote access.

FAQ

Can wearable AI devices work without an internet connection in New Zealand?

Yes. On-device speech-to-text models like Sherpa ONNX run entirely on local hardware without network connectivity. This is particularly relevant for New Zealand professionals working on rural properties, construction sites, or any environment where reliable internet is not available. The accuracy is lower than cloud alternatives but improving rapidly with each model generation.

What are the privacy risks of cloud-based AI wearable devices?

Cloud-based wearables transmit recorded audio to remote servers for processing, creating cross-border data transfers that may conflict with New Zealand's Privacy Act 2020. Users lose control over data retention, access policies, and jurisdiction. If the vendor is acquired, changes terms, or suffers a breach, your most personal recordings are exposed. On-device processing eliminates these risks entirely.

How accurate is offline speech-to-text compared to cloud AI transcription?

On-device models running on microcontroller hardware typically achieve 80-90% word accuracy depending on conditions, compared to 95%+ for cloud services like OpenAI Whisper. The gap is narrowing as model compression techniques improve. For most use cases, an 85% accurate transcript stored securely on your own device is more valuable than a perfect transcript stored on someone else's server.

Does EmbedAI sell a privacy-first wearable AI device?

Not currently. This was a research and exploration project investigating the technical viability of fully offline voice capture on wearable hardware. The research validated that the approach works with current technology. We have not productised it yet but the findings inform our broader AI consulting practice and our understanding of edge AI capabilities for New Zealand businesses.

What is edge AI processing and why does it matter for New Zealand businesses?

Edge AI processing means running artificial intelligence models directly on local hardware rather than sending data to cloud servers. For New Zealand businesses, this matters because many operate in environments with limited connectivity, handle sensitive client data governed by the Privacy Act, or simply prefer to keep their operational data under their own control. Edge AI eliminates cloud dependency, latency, and ongoing API costs.