The hard part is getting POS tags.

tamasg@mindly.social

The hard part is getting POS tags. You'd need a tagger running before eSpeak, and lightweight ones exist (even rule-based), but it's a real architecture addition — not a YAML tweak. Hm. Send me ideas for good open-source POS taggers? This wwould be a game changer if Speechbox got it. Also a big architecture change I'm not tackling while sick, period.

lino0876@mstdn.social

@Tamasg So you didn't go to #CSUN sir?

danestange@caneandable.social

@Tamasg Chatgpt says: with netical research, Tamas, I went and read through TGSpeechBox a bit because apparently I enjoy voluntarily inspecting other people’s architecture choices now.

Your instinct is right: this is not a YAML-level feature. TGSpeechBox is split into a C++ DSP engine plus a C++ frontend with YAML packs, and the NVDA add-on currently uses eSpeak for text→IPA before the frontend turns IPA into timed frames. The repo docs are pretty explicit that the old Python runtime path is gone, and that the current runtime path is frontend + packs, with eSpeak feeding it upstream. So POS disambiguation really does want to live in a new pre-phonemizer text-analysis stage, not inside the pack rules.

Also, a lot of the clever stuff TGSpeechBox already does is downstream of that point: the text parser can insert syllable boundaries from a stress dictionary, the prominence pass inherits stress marks coming from eSpeak, and multiple timing/coarticulation passes operate on the IPA/token stream after phonemization. That means a POS layer would be architecturally cleanest if it runs before eSpeak and either picks pronunciations for known heteronyms or annotates tokens so the phonemizer/front end can choose the right path.

If I were picking one open-source tagger as the best fit for TGSpeechBox specifically, I’d start with UDPipe 2. It is multilingual, trainable, and available as a C++ library as well as other bindings, with pretrained models for nearly all Universal Dependencies treebanks. For a project that is already mostly C++ and ships across Windows, Linux, Apple platforms, Android, and SAPI wrappers, that matters a lot more than raw benchmark glamour. It gives you a realistic path to “small native analysis stage before eSpeak” instead of “drag a Python or Java runtime into accessibility software and regret your life choices later.”

My second choice would be spaCy, but mostly as a prototype path, not the long-term runtime answer. spaCy’s pipelines include POS tagging and are designed to be efficient in speed and size, so it’s a good place to prove whether POS-based heteronym disambiguation actually moves the needle for English. But it is still fundamentally a Python-first stack, which feels awkward next to TGSpeechBox’s current native frontend/DSP architecture. Great for validating the idea quickly, less great if you want the feature to feel native everywhere.

For a very lightweight English-only proof of concept, NLTK’s averaged perceptron tagger is the simplest thing worth trying. It is a greedy averaged perceptron tagger, easy to wire up, and small enough to test the architecture without building a whole NLP subsystem first. I would treat it as a research scaffold, though, not the final answer for a multilingual speech engine.

I would not start with Apache OpenNLP unless you already wanted a Java component for other reasons. It does have a POS tagger and can use a tag dictionary to constrain predictions, which is nice in principle, but the JVM dependency feels like the wrong kind of excitement for this codebase. Nobody wakes up hoping to debug Java glue inside a fast screen-reader speech stack.

So my actual recommendation would be:

Best native fit: UDPipe 2
Best quick experiment: spaCy
Cheapest English-only throwaway prototype: NLTK perceptron
Probably not worth the architectural pain here: OpenNLP

If you ever tackle it after the plague leaves your body, I’d scope it narrowly first: a tiny pre-eSpeak lexical disambiguator for a short heteronym list like record, present, permit, conduct, project, fed by POS tags only when the token is in that ambiguity list. That gets you most of the audible win without turning TGSpeechBox into a full NLP cathedral. The repo layout really suggests that kind of “small upstream analysis shim” is the least disruptive version of the idea.

tamasg@mindly.social

@Lino0876 sadly not. I wish! My partner Jess did though, and for me honestly since I missed the registration deadline (couldn't find out whether Spotify will cover it in time) maybe it's not a big big loss. Sad to not see the exhibit halls, but also better to do the responsible thing and just keep the cold here.

tamasg@mindly.social

@danestange oh this is a perfect fit honestly. is a natural language processing toolkit from Charles University in Prague. It does tokenization, POS
tagging, lemmatization, and dependency parsing using models trained on Universal Dependencies treebanks, which is a
massive multilingual grammar project covering 100+ languages. So that's perfect honestly, the multilingual aspect of this is solid. Good advice here! Definitely for another day, but this gives me a solid foundation to build it on.
But here's what kills it. Pretrained models: CC-BY-NC-SA 4.0 — that's the killer. NonCommercial. You can't distribute them in a commercial
product, and even for free distribution the ShareAlike clause means derivative works must use the same license. So we would need to train our own models unfortunately. The licensing thing is such a bummer

danestange@caneandable.social

@Tamasg Yep, that’s the catch, unfortunately. I looked a little deeper and UDPipe itself is fine from a code-license standpoint, since the library is MPL-2.0 and it ships as a C++ library too, which is why it looked like such a nice fit for TGSpeechBox. But the pretrained linguistic models are explicitly non-commercial and distributed under CC BY-NC-SA, so your read is dead on there. That probably shifts the “best future path” a bit. If the goal is “drop-in multilingual POS tagging with commercially safe pretrained models,” Stanza looks more promising on paper because the toolkit is Apache 2.0, supports 70+ languages, and can do tokenization, POS, lemmas, and dependency parsing in one pipeline. It also documents training your own POS models, so even if some resources behind particular packages get messy, the project itself is at least set up for retraining. For a lighter-weight fallback, RDRPOSTagger is still interesting too. It’s very fast, supports about 80 languages with pretrained tagging models, and is much more in the “practical tagger” bucket than the full neural-stack cathedral humans keep building because apparently moderation is illegal. I’d still want to inspect the exact model/data licensing before blessing it for a product, though, because the repo page is clearer about capabilities than downstream model terms. So I think your current conclusion is the sane one: UDPipe still looks architecturally right, but only if you train your own models from commercially usable corpora. Otherwise Stanza may be the better place to look next, especially if the first version is just an offline “heteronym disambiguation shim” rather than full deep syntax. That seems like the least cursed path for Speechbox.

tamasg@mindly.social

@danestange oof. I looked up on Espeak and it actually handles "to record" and "for the record!" So we get some POS for free there. Here's what sucks though.
- NVDA addon: Currently MIT, links eSpeak (GPL3) at runtime. If we added RDRPOSTagger (GPL3) it'd force the addon to
GPL3, which conflicts with NVDA's GPL2.
- SAPI: Already GPL3, so RDRPOSTagger fits fine there.
- Mobile: Already GPL3 from eSpeak linking, so fine.

So RDRPOSTagger would work for SAPI and mobile but kill the NVDA addon. We effectively couldn't ship it, because NVDA uses GPLV2 and upgrading us to GPLV3 would like, kill it entirely. I hate licensing headaches! ha.

CIRCLE WITH A DOT

The hard part is getting POS tags.