Apple’s wild experiment creates a coding model that secretly learned SwiftUI without real examples and somehow outshone GPT-4 at compiling apps

Students using Apple devices in a classroom.

(Image credit: Apple)

Apple started with almost no Swift examples and achieved surprising results
StarChat-Beta was pushed into uncharted territory without clear guidance
Nearly one million working SwiftUI programs emerged after repeated iterations

Apple researchers recently revealed an experiment in which an AI model was trained to generate user interface code in SwiftUI, even though almost no SwiftUI examples were present in the original data.

The study began with StarChat-Beta, an open source model designed for coding. Its training sources, including TheStack and other collections, contained almost no Swift code.

This absence meant the model did not have the advantage of existing examples to guide its responses, which made the results surprising when a stronger system eventually emerged.

Creating a loop of self-improvement

The team’s solution was to create a feedback cycle. They gave StarChat-Beta a set of interface descriptions and asked it to generate SwiftUI programs from those prompts.

Each generated program was compiled to ensure it actually ran. Interfaces that worked were then compared with the original descriptions using another model, GPT-4V, which judged whether the output matched the request.

Only those that passed both stages remained in the dataset. This cycle was repeated five times, and with every round, the cleaner dataset was fed back into the next model.

By the end of the process, the researchers had nearly one million working SwiftUI samples and a model they called UICoder.

The model was then measured against both automated tests and human evaluation, where results showed it not only performed better than its base model, but also achieved a compilation success rate higher than GPT-4.

One of the striking aspects of the study is that Swift code had been almost entirely excluded from the initial training data.

According to the team, this happened by accident when TheStack dataset was created, leaving only scattered examples found on web pages.

This oversight rules out the idea that UICoder merely recycled code it had already seen - instead, its improvement came from the iterative cycle of generating, filtering, and retraining on its own outputs.

While the results centered on SwiftUI, the researchers suggested the approach “would likely generalize to other languages and UI toolkits.”

If so, this could open paths for more models to be trained in specialized domains where training data is limited.

The prospect raises questions about reliability, sustainability, and whether synthetic datasets can continue to scale without introducing hidden flaws.

UICoder was also trained under carefully controlled conditions, and its success in wider settings is not guaranteed.

Via 9to5mac

The end of laptops at work? Desktop as a service is now cheaper and easier to run
We've listed the best AI video editors right now
These are the best AI writers available

TOPICS

Efosa has been writing about technology for over 7 years, initially driven by curiosity but now fueled by a strong passion for the field. He holds both a Master's and a PhD in sciences, which provided him with a solid foundation in analytical thinking.

You must confirm your public display name before commenting

Please logout and then login again, you will then be prompted to enter your display name.

Apple employees built an LLM that taught itself to produce good user interface code - but worryingly, it did so independently

Creating a loop of self-improvement

You might also like