Turn Word files into SEO-optimized web pages: Boost your content reach with AI

Preserving a legacy of conferences

A non-profit organization holds a unique archive: decades of conference transcripts stored in Microsoft Word 97/2000 files.

They were rich in content but poor in structure—long documents without proper titles, headings, or metadata. Making them available online, searchable, and SEO-friendly required a modern, scalable solution.

graph TD %% Nodes WORD["Legacy Word Files"] PANDOC["Convert to Markdown with Pandoc"] SPLIT["Split into Sections using Python Script"] GPT_META["Generate Metadata and SEO with GPT-4o"] REVIEW["Review Changes with git diff / Human Oversight"] BUILD["Build and Publish Web Pages"] %% Flows WORD --> PANDOC --> SPLIT --> GPT_META a3@--> REVIEW GPT_META a4@--> BUILD a5@--> ITERATE REVIEW a2@-->|Adjust script or GPT prompt| GPT_META ITERATE["Next File / Repeat Workflow"] a1@--> GPT_META a1@{ animation: slow } a2@{ animation: slow } a3@{ animation: slow } a4@{ animation: slow } a5@{ animation: slow } %% Styling classDef source fill:#fbe9e4,stroke:#df9277,stroke-width:2px,color:#5a2e23,font-weight:bold,rx:8,ry:8; classDef conversion fill:#f6d1c5,stroke:#d97d61,stroke-width:2px,color:#4a241a,font-weight:bold,rx:8,ry:8; classDef ai fill:#efb9a6,stroke:#c96a52,stroke-width:2px,color:#3d1d15,font-weight:bold,rx:8,ry:8; classDef review fill:#df9277,stroke:#b85b3f,stroke-width:2px,color:#fff,font-weight:bold,rx:8,ry:8; classDef build fill:#f6d1c5,stroke:#d97d61,stroke-width:2px,color:#4a241a,font-weight:bold,rx:8,ry:8; classDef iterate fill:#fbe9e4,stroke:#df9277,stroke-width:2px,color:#5a2e23,font-weight:bold,rx:8,ry:8; %% Apply classes class WORD source; class PANDOC,SPLIT conversion; class GPT_META ai; class REVIEW review; class BUILD build; class ITERATE iterate; %% Link styles linkStyle default stroke:#df9277,stroke-width:2px;

This post explains how we achieved that transformation with a mix of Markdown, Pandoc, and GPT-powered automation, at a total cost of about $30.

Structuring legacy content

The first challenge was to add structure to the old Word files:

Applied properly styled titles to the previously unstructured transcripts.
Converted Word to Markdown using Pandoc, ensuring a clean, flexible format for web publication.

Splitting content into sections

Instead of publishing massive files, we needed one Markdown file per conference section. ChatGPT generated a custom Python script that split Markdown files into multiple smaller files based on titles.

Structuring legacy content: Multicolored folders

This step alone transformed chaotic transcripts into readable, accessible documents.

Enriching metadata with GPT-4o

To optimize for browsing and search, we asked GPT-4o to analyze each section and:

Extract keywords for thematic browsing and SEO.
Generate frontmatter for Astro, including:
- title (Google title)
- description (Google SEO description)
Assign dates by parsing the human-readable dates embedded in the original file titles.
Suggest intermediary titles to improve readability and flow.

Iterative workflow

Automation at scale required care. The process looked like this:

Iteratively ran the script on all Markdown files in a directory.
Reviewed changes with git diff to verify metadata and titles.
Stopped the script and tuned the GPT prompt whenever adjustments were needed.
Once results were solid, reran the process on all 1,500 files in batch mode.

This balance of AI automation with human oversight ensured quality and consistency.

The scale of the project

2,500 web pages generated
1.8 million words processed
6 million GPT-4 tokens consumed, the energy equivalent of a 300 km car journey
About $30 total cost for GPT processing
Automatic daily publication until 2032
Immediate access to the full archive for authorized members via password
Classification by date and theme
Keyword-based search for easy navigation

A living digital archive

Thanks to this workflow, the non-profit now has a self-sustaining publication pipeline:

A steady stream of daily conference publications extending into the next decade
A browsable, SEO-friendly archive that will continue to attract readers and researchers
A digital preservation strategy that respects tradition while embracing modern tools

This project shows how AI-assisted automation can help even resource-limited organizations bring their cultural heritage online.

Lessons learned

Cost control is possible: By running scripts iteratively, reviewing diffs, and only scaling once satisfied, we kept the entire project under $30.
Human-in-the-loop matters: Automated metadata generation is powerful, but occasional prompt tuning and git-based review were essential for accuracy.
User expertise is crucial: The person guiding the AI must have some knowledge of what they want it to do. For example, explicitly asking GPT to use the Python BeautifulSoup library ensured proper handling of embedded HTML fragments within Markdown content.
Environmental awareness is key: Processing 6 million tokens had a measurable carbon footprint. Future work should explore greener infrastructure and ways to minimize energy use.
AI is a collaborator, not a replacement: The best results came from combining GPT automation with careful human validation.

Looking ahead

The next steps include:

Adding cross-linking between related conferences.
Exploring multilingual summaries for international reach.
Populating pages with AI-indexed images to create a richer browsing experience.
Evaluating sustainability practices to further minimize the environmental impact of large-scale AI processing.

While the Python script querying GPT-4o was carefully tuned to minimize hallucinations, there is no way to guarantee complete accuracy. To ensure quality, the outputs are being slowly reviewed and compared by members of the non-profit. Future improvements could include leveraging LangChain or similar frameworks to have a local AI process the large volume of text, split it into manageable chunks, and automatically compare input and output to flag discrepancies, further reducing manual verification effort.

With a bit of Python, a lot of Markdown, and the power of AI, what began as static Word files has become a dynamic web archive—accessible, searchable, and preserved for the future.

The conferences will start being published daily on the non-profit’s website to maximize Google reach while avoiding overwhelming the site with too much content at once. This approach ensures steady visibility, improved SEO, and gradual access for readers.