Blog

Turn Word files into SEO-optimized web pages: Boost your content reach with AI

Olivier Carrère
#DITA#Markdown#Information Typing#AI#Python#Digital Archives#Automation#SEO

Preserving a legacy of conferences

A non-profit organization holds a unique archive: decades of conference transcripts stored in Microsoft Word 97/2000 files.

They were rich in content but poor in structure—long documents without proper titles, headings, or metadata. Making them available online, searchable, and SEO-friendly required a modern, scalable solution.

This post explains how we achieved that transformation with a mix of Markdown, Pandoc, and GPT-powered automation, at a total cost of about $30.

Structuring legacy content

The first challenge was to add structure to the old Word files:

Splitting content into sections

Instead of publishing massive files, we needed one Markdown file per conference section. ChatGPT generated a custom Python script that split Markdown files into multiple smaller files based on titles.

This step alone transformed chaotic transcripts into readable, accessible documents.

Enriching metadata with GPT-4o

To optimize for browsing and search, we asked GPT-4o to analyze each section and:

Iterative workflow

Automation at scale required care. The process looked like this:

  1. Iteratively ran the script on all Markdown files in a directory.
  2. Reviewed changes with git diff to verify metadata and titles.
  3. Stopped the script and tuned the GPT prompt whenever adjustments were needed.
  4. Once results were solid, reran the process on all 1,500 files in batch mode.

This balance of AI automation with human oversight ensured quality and consistency.

The scale of the project

A living digital archive

Thanks to this workflow, the non-profit now has a self-sustaining publication pipeline:

This project shows how AI-assisted automation can help even resource-limited organizations bring their cultural heritage online.

Lessons learned

Looking ahead

The next steps include:

While the Python script querying GPT-4o was carefully tuned to minimize hallucinations, there is no way to guarantee complete accuracy. To ensure quality, the outputs are being slowly reviewed and compared by members of the non-profit. Future improvements could include leveraging LangChain or similar frameworks to have a local AI process the large volume of text, split it into manageable chunks, and automatically compare input and output to flag discrepancies, further reducing manual verification effort.

With a bit of Python, a lot of Markdown, and the power of AI, what began as static Word files has become a dynamic web archive—accessible, searchable, and preserved for the future.

The conferences will start being published daily on the non-profit’s website to maximize Google reach while avoiding overwhelming the site with too much content at once. This approach ensures steady visibility, improved SEO, and gradual access for readers.

← Back to Blog