The OSI's take on this is that an open source model can be modified through fine-tuning etc, even if you can't rebuild it from scratch.
The problem with requiring "build from scratch" for open source models is that the number of interesting models with training data that can be openly licensed is close to zero.
If you trained your model on an unlicensed scrape of the web you can't release the data under an open source license!
I would personally disagree slightly with this take. Freely being able to use means IMHO, that this can be done for all applications in a legal (and ideally ethical) fashion. Regulation often requires to prove the quality or provenance of data. Open source has IMHO often a very libertarian view on things focusing on the rights of the user an not society in general.
With free libre software, where freedom and liberty are about what the end user is empowered with actually, the software is mostly metonymic. Free software, free society, because there are free people in the middle of course.
Right, as I said elsewhere, maybe let's just let "open-source" have it.
"Open-source" can be "anything you can go out and grab a copy of and use" but doesn't give you much legal certainty about any of it, and reserve "free software" for the other, better thing.
But, free software lost it's way around GPLv3. From the end user's perspective, GPLv3 says that you can only use the software if it's either a cloud service, on hypothetical open firmware devices, or if you install it yourself.
AGPLv3 partially solves the issue by blocking people like Google from using it to build proprietary cloud services that take away their users' freedom. (It still doesn't solve the problem where providers use network effects to achieve the same end game.)
> From the end user's perspective, GPLv3 says that you can only use the software if it's either a cloud service, on hypothetical open firmware devices, or if you install it yourself.
I've been pronouncing both of them as /dʒis/ like hiss and not /dʒɪz/. I however am not a native english speaker of English. I wonder if native speakers gravitate towards the z more?
Same way I pronounce my first name btw ;) but I think of "gif" as "gift" and this is probably the subconscious association people make without realizing it.
Which is why I find it fun to bring up that in Old English "gift" hadn't yet picked up the "t" and was spelled "gif", but in Old English "g" was most commonly "HY". I like the Old English pronunciation of "gif" as "HYEEF", which is a "compromise" position that often makes some of both soft-g and hard-g "gif" pronunciation fans angry.
Basically hallucinations are false external things, and delusions false internal things. You hallucinate a pink elephant, you delude yourself into thinking trump won 2020.
Devils advocate here: I can give you a binary of my open source MIT code and never phone you the code. The code is still MIT licensed, and open source. You just have no access to it.
That said, I entirely agree that MS is misrepresenting their openness here, which isn’t in the least surprising.
In their defense, most everyone else does the same thing. They still shouldn't do it, but at least they're not the trendsetter here (though they are contributing to the ongoing problem)
Open weights is not exactly right either because we do get source of the software that uses those open weights.
Maybe open inference?
But we often also get source code for fine tunning the model.
So maybe it's closer to open source than to anything else?
Isn't it a bit like not calling a game open source because engine tooling used to made it isn't open source and they didn't publish .psd files with asset designs?
I mean, you have "AI" which means just about anything in marketing speak, "Agentic" is kind of becoming similar, hopefully they don't goof that one too badly, would be nice to know what you are trying to sell me. Used to be "Cloud" meant storage not just hosting (I guess it still does).
Then there's "Smart" in front of Car, Phone, TV, and so on... Meaning different things.
I do think "Open Weight" should be more commonly used. There's definitely communities that spring up that build the training infrastructure and inference infrastructure around open models on the other hand.
I'm genuinely torn on this one; I get technically why not, but why I think I have no problem with it is the wishy-washiness of "open source" generally.
As I teach this stuff to people newer to this tech, it's probably just easier and more helpful to refer to the wide array of "stuff you can just download and use yourself" as "open-source" and then after that, go deeper and talk about why Stallman was right, how "Free Software" was first. etc.
> “This means a future of abundance. A future where there is no poverty, where people can have whatever they want in terms of goods and services.” – Elon Musk
> “I think we see a path now where the world gets much more abundant and much better every year.” – Sam Altman
It is not good for text to speech (TTS) as well. I am trying it for few days. First of all 1.5B model documentation is not there. 0.5B realtime is shit model. I was converting text, line by line and it was randomly adding music and couldn't handle special characters like "…".
I really disappointed with this model to say the least.
To be fair, his Midas touch is a result of consistency and a lot of hard work.
It's like the gardener at one of the Oxford colleges said - it's really easy to create these perfect lawns, just turn up every day and trim and water it - for a couple hundred years.
there is so much more subversive marketing out there than any of us can really fathom. i try not to be too paranoid but it's getting a lot harder every day.
i know someone who worked in what we might call the 'astroturfing' space within the entertainment industry. after having a few discussions with him and with things like this[0] becoming more known, it's really difficult to afford any assumption of organic intent when money is on the line - especially at the scale that microsoft works at compared to something as comparatively quaint as the music industry.
I've been using VibeVoice's ASR (speech to text) model quite intensively for the past month and have found it to be a lot more reliable and out-of-the box functional then Whisper, parakeet and other models. The fact that is has diarization built into to the model is a huge win in my book. Without that you have to run a different model just for that which adds significantly to the overall processing time vs VibeVoice which gives you reliably great results. Big fan.
In my mind, Vibe-anything means "some slop carelessly thrown together to ship as fast as possible." Wild that it's being used in a serious product name!
"get offended" is just what the clickbait news cycle made of it. It was based on the post at [1], and this is all it said:
> We need to get beyond the arguments of slop vs sophistication and develop a new equilibrium in terms of our “theory of the mind” that accounts for humans being equipped with these new cognitive amplifier tools as we relate to each other
When a CEO says "We need to get beyond the arguments of X" it is universally a polite, PR-scrubbed way of saying, "Please stop talking about X, it is hurting our business" which is how the media interpreted it.
Look at the "News" section in the readme - The original TTS model is gone from this repo (you can still find it other places), but the SST/ASR, long form TTS, and streaming TTS models are newer.
When explanations get posted directly in HN comments, I imagine someone somewhere in the world is able to learn in spite of their Internet restrictions/firewalls
People will also post their own interpretations in response to comments, and quickly find out they missed something.
… But if you try to automate it, like include a summary under every HN post, you encourage laziness too much and are pre-chewing too heavily. Some balance here.
[on topic]
(OK I’m done making excuses, time to read the article… thanks for the encouragement!)
I thought this was not explained in the readme directly but in fact I missed it. I wasn’t going to read Microsoft entire changelog! But it was substantive, thanks to sibling commenter:
“2025-09-05: VibeVoice is an open-source research framework intended to advance collaboration in the speech synthesis community. After release, we discovered instances where the tool was used in ways inconsistent with the stated intent. Since responsible use of AI is one of Microsoft’s guiding principles, we have removed the VibeVoice-TTS code from this repository.”
Note that this just covers the Speech-to-Text/Speech-Recognition aspect (a-la whisper), there's also models for long-form Text-To-Speech and steaming Text-To-Speech.
I'd be willing to bet it will be "Word of the Year" for 2026. Merriam-Webster had 'slop' for 2025, and 'polarization' for 2024. Is there a prediction market for this?
I took a look into local options for ASR and diarization some months ago, I missed that VibeVoice now has this feature.
My conclusions back then (which only came from a shallow research on the topic and 0 real experience mind you) was that Whisper + Pyannote was the "stable" approach.
Have the VibeVoice, Voxtral, Qwen or the Nemo solutions caught up in segmentation and speaker recognition?
Local? No idea. Cloud? Eleven Labs, probably. But it's described as "cloning" not "training". Not sure what the distinction is or why it matters if the end result is you can to generate any TTS that sounds like you. There might very well be an important one, I just don't know it.
Seems quite heavy for a STT model, Parakeet and Whisper are much smaller and perform great for quick dictation and transcription of longer files. I guess that's due to additional accuracy and speaker diarisation?
The TTS example clip in the repo of 'spontaneous singing' is creepy as fuck
https://github.com/microsoft/VibeVoice/issues/102
I care that I know what I can DO with the project when I see it described as "open source".
Yes, the first of which is that you should be able to build it from source. Which requires the source code, and in this case data.
The problem with requiring "build from scratch" for open source models is that the number of interesting models with training data that can be openly licensed is close to zero.
If you trained your model on an unlicensed scrape of the web you can't release the data under an open source license!
The Open Source Initiative have a bunch of their thinking around this in their FAQ for the "Open Source AI definition": https://opensource.org/ai/faq#isn-t-training-data-required-t...
https://huggingface.co/allenai/OLMo-2-0325-32B
Here's one of their training mixes: https://huggingface.co/datasets/allenai/dolma3_pool - which includes 8 trillion tokens from Common Crawl.
Maybe we should have a little cue card for models: vendor/name, size, open weights, open source, permissive license.
It’s simple enough an idea.
"Open-source" can be "anything you can go out and grab a copy of and use" but doesn't give you much legal certainty about any of it, and reserve "free software" for the other, better thing.
AGPLv3 partially solves the issue by blocking people like Google from using it to build proprietary cloud services that take away their users' freedom. (It still doesn't solve the problem where providers use network effects to achieve the same end game.)
What in the world do you mean?
This ship has sailed. It’s now in the same category as hacker/cracker and the pronunciation of GIF.
Way early on (spring 2023) people tried to stop it, but no luck.
A delusion is a false mental belief.
Basically hallucinations are false external things, and delusions false internal things. You hallucinate a pink elephant, you delude yourself into thinking trump won 2020.
That said, I entirely agree that MS is misrepresenting their openness here, which isn’t in the least surprising.
Maybe open inference?
But we often also get source code for fine tunning the model.
So maybe it's closer to open source than to anything else?
Isn't it a bit like not calling a game open source because engine tooling used to made it isn't open source and they didn't publish .psd files with asset designs?
Then there's "Smart" in front of Car, Phone, TV, and so on... Meaning different things.
I do think "Open Weight" should be more commonly used. There's definitely communities that spring up that build the training infrastructure and inference infrastructure around open models on the other hand.
As I teach this stuff to people newer to this tech, it's probably just easier and more helpful to refer to the wide array of "stuff you can just download and use yourself" as "open-source" and then after that, go deeper and talk about why Stallman was right, how "Free Software" was first. etc.
> “I think we see a path now where the world gets much more abundant and much better every year.” – Sam Altman
https://www.diamandis.com/blog/elon-sam-abundance
Edit: I'm talking purely about speech to text (STT). Not sure about the other things this can do.
I really disappointed with this model to say the least.
It's like the gardener at one of the Oxford colleges said - it's really easy to create these perfect lawns, just turn up every day and trim and water it - for a couple hundred years.
i know someone who worked in what we might call the 'astroturfing' space within the entertainment industry. after having a few discussions with him and with things like this[0] becoming more known, it's really difficult to afford any assumption of organic intent when money is on the line - especially at the scale that microsoft works at compared to something as comparatively quaint as the music industry.
[0] https://www.wired.com/story/geese-chaotic-good-marketing-ind...
https://github.com/microsoft/VibeVoice/commit/e73d1e17c3754f...
which is microsoft for "we removed two dead links". AI innovation knows no limits!
[1] https://doublepulsar.com/microsoft-vibing-capturing-screensh...
> We need to get beyond the arguments of slop vs sophistication and develop a new equilibrium in terms of our “theory of the mind” that accounts for humans being equipped with these new cognitive amplifier tools as we relate to each other
[1] https://snscratchpad.com/posts/looking-ahead-2026/
When explanations get posted directly in HN comments, I imagine someone somewhere in the world is able to learn in spite of their Internet restrictions/firewalls
People will also post their own interpretations in response to comments, and quickly find out they missed something.
… But if you try to automate it, like include a summary under every HN post, you encourage laziness too much and are pre-chewing too heavily. Some balance here.
[on topic]
(OK I’m done making excuses, time to read the article… thanks for the encouragement!)
I thought this was not explained in the readme directly but in fact I missed it. I wasn’t going to read Microsoft entire changelog! But it was substantive, thanks to sibling commenter:
“2025-09-05: VibeVoice is an open-source research framework intended to advance collaboration in the speech synthesis community. After release, we discovered instances where the tool was used in ways inconsistent with the stated intent. Since responsible use of AI is one of Microsoft’s guiding principles, we have removed the VibeVoice-TTS code from this repository.”
Why?
My conclusions back then (which only came from a shallow research on the topic and 0 real experience mind you) was that Whisper + Pyannote was the "stable" approach.
Have the VibeVoice, Voxtral, Qwen or the Nemo solutions caught up in segmentation and speaker recognition?
https://cyberpress.org/microsoft-store-app-vibing-exe-accuse...
Elevenlabs in the cloud.
- Cohere Transcribe (self hosted)
- Grok Speech To Text (they provide an API, only $0.10/hr!)
They are both excellent. I'm not sure about this one. Would you like to see it in a consumer speech to text app?
Sept 2025 https://news.ycombinator.com/item?id=45114245
The TTS example clip in the repo of 'spontaneous singing' is creepy as fuck