Microsoft has released the latest version of its Phi-3.5 small language model. This new version is a significant upgrade over the previous generation, outperforming small models from major players like Google, OpenAI, Mistral, and Meta on several important metrics.
Phi-3.5 is available in 3.8 billion, 4.15 billion and 41.9 billion parameter versions. All three versions are available for free download and can be run using a local tool like Ollama.
It performed particularly well in reasoning, being beaten only by GPT-4o-mini among the best small models. It also performed well on math tests, far outperforming Llama and Gemini.
Small language models like Phi-3.5 demonstrate efficiency improvements in AI and add credence to OpenAI CEO Sam Altman's goal of creating intelligence that is too cheap to measure.
What's new in Phi-3.5
🔥 The new Phi-3.5 models are now in the Open LLM leaderboard! • Phi-3.5-MoE-instruct leads all Microsoft models with an average score of 35.1, ranking 1st in category 3B and 10th among all chat models • Phi-3.5-mini-instruct scored 27.4 points, taking 3rd place in category 3B… pic.twitter.com/yNcOR2bcxXAugust 22, 2024
Phi-3.5 comes in a vision model version that can understand images and not just text, as well as a mix of expert models to distribute learning tasks across different subnetworks for more efficient processing.
The mixture of expert models outperforms Gemini Flash 1.5, which is the model used in the free version of the Gemini chatbot on several benchmarks and has a large context window of 128,000. While this is significantly smaller than Gemini itself, it is on par with ChatGPT and Claude.
The main advantage of a very small model like the one I installed is that it could be paired with an app or even installed on an IoT device like a smart doorbell. This would allow facial recognition without sending data to the cloud.
The smallest model was trained on 3.4 trillion data tokens using 512 Nvidia H100 GPUs over 10 days. The expert model mixture consisted of 16 models with 3.8 billion parameters, used 4.9 trillion tokens, and took 23 days to train.
How well does Phi-3.5 actually work?
I installed and ran the smaller 3.8 billion parameter version of Phi-3.5 on my laptop and found it less impressive than the benchmarks suggest. While it is verbose in its responses, the wording often left much to be desired and it struggled with some simple tests.
I asked him a classic question: “Write a short one-sentence story in which the first letter of a word is the same as the last letter of the previous word.” Even after clarification, it failed miserably.
I haven't tried the larger mix of expert models. However, I'm told that, judging by benchmarks, it addresses some of the issues with the version of the model I tried. Benchmarks suggest that its output will be similar in quality to OpenAI's GPT-4o-mini, the version that ships with the free version of ChatGPT.
One area that seems to outperform GPT-4o-mini is STEM and social sciences. Its architecture allows it to maintain efficiency while handling complex AI tasks in different languages.