Just discovered Microsoft has a tool to convert documents (pdf, docx, pttx, xlsx, html, outlook messages...) to markdown as well as transcribe audio and Youtube links!
-
Just discovered Microsoft has a tool to convert documents (pdf, docx, pttx, xlsx, html, outlook messages...) to markdown as well as transcribe audio and Youtube links! https://github.com/microsoft/markitdown
-
Just discovered Microsoft has a tool to convert documents (pdf, docx, pttx, xlsx, html, outlook messages...) to markdown as well as transcribe audio and Youtube links! https://github.com/microsoft/markitdown
@DavidGoldfield @chikim Oh my goodness! I ask and then, I receive! Just a few minutes ago, I wanted to convert a Word document directly into Markdown. Thank you!
-
R relay@relay.infosec.exchange shared this topic
-
Just discovered Microsoft has a tool to convert documents (pdf, docx, pttx, xlsx, html, outlook messages...) to markdown as well as transcribe audio and Youtube links! https://github.com/microsoft/markitdown
-
@DavidGoldfield @chikim Oh my goodness! I ask and then, I receive! Just a few minutes ago, I wanted to convert a Word document directly into Markdown. Thank you!
@jcast432 @DavidGoldfield @chikim FOr something that might be more presentable, try https://pandoc.org/app, or it has a CLI too, pandoc is the known standard tool for document conversion.
-
@marvellousmachine @chikim @ondrosik Any particular reason pandoc doesn't cut it here? Is it the transcription? Or does pandoc not read some of those formats like PDF?
-
@marvellousmachine @chikim @ondrosik Any particular reason pandoc doesn't cut it here? Is it the transcription? Or does pandoc not read some of those formats like PDF?
-
@marvellousmachine @chikim @ondrosik Any particular reason pandoc doesn't cut it here? Is it the transcription? Or does pandoc not read some of those formats like PDF?
@x0 @marvellousmachine @ondrosik Not sure if Pandoc has support for OCR, out look messages, speech transcription, LLM support for MCP server, etc. Total speculation, but I suspect they created specifically to digest all kinds of documents for LLM training.
-
@x0 @marvellousmachine @ondrosik Not sure if Pandoc has support for OCR, out look messages, speech transcription, LLM support for MCP server, etc. Total speculation, but I suspect they created specifically to digest all kinds of documents for LLM training.
@x0 @marvellousmachine @ondrosik A lot of people also mentioned that docling is better! It might be worth to check out.
-
@x0 @marvellousmachine @ondrosik Not sure if Pandoc has support for OCR, out look messages, speech transcription, LLM support for MCP server, etc. Total speculation, but I suspect they created specifically to digest all kinds of documents for LLM training.
@chikim @marvellousmachine @ondrosik It says in the README actually, MS did create it for training AI which speaks markdown, and yeah those features are definitely not part of pandoc.
-
@x0 @marvellousmachine @ondrosik A lot of people also mentioned that docling is better! It might be worth to check out.
@chikim @marvellousmachine @ondrosik Now what I'd lvoe to see is something that can take, say, a PDF mathematical paper and convert it into markdown and LaTeX. They exist, marker-pdf and nougat-ocr, but IDK how workable they are with structured data like tables, and haven't paire dthem with cloud AI jusst yet becuase haven't got the tokens and usage is a bit difficult.
-
P pixelate@tweesecake.social shared this topic
