Blog

Transforming a Corpus of 7,000 Pages into Living Knowledge

Olivier Carrère
#AI#Knowledge#Curation#Discovery

From 7,000 Pages to Daily Insights

How can a non-profit turn 7,000 pages of articles—a vast sea of text—into content that is alive, readable, and meaningful every day?

flowchart LR Files@{ shape: cyl, label: "1.8 million words 1,800 articles 7,000 pages" }-->Public[Target audience]

This project extends a broader effort to leverage a massive corpus—1,800 articles, 1.8 million words, 7,000 pages—into a discovery system that makes exploration smarter, more meaningful, and deeply human. Instead of letting a monumental archive gather digital dust, the goal is to let it speak again—one insight, one quote, one topic at a time.


Closing the Gap with Content

Analytics from the non-profit’s website show a clear pattern: the audience is predominantly aging and male.

While loyal and deeply engaged, this profile highlights a challenge for the organization’s future—its reach remains limited, leaving out younger visitors and potential readers whose interests, language, and expectations differ.

To ensure greater inclusivity and enable a renewal of generations, the non-profit’s communication now aims to better engage women and people aged 25 to 35, fostering a more balanced and sustainable community over time.

xychart title "Age Distribution of Website Audience" x-axis ["18–24", "25–34", "35–44", "45–54", "55–64", "65+"] y-axis "Percentage (%)" 0 --> 30 bar [0, 3.7, 15, 21.7, 28.4, 31.2]

To bridge this gap, the corpus of 1,800 articles is being used not only as a knowledge base but also as a dialogue tool. By aligning the content’s depth with the audience’s real concerns, the project aims to make long-form wisdom resonate with new generations of readers.

xychart title "Gender Distribution of Website Audience" x-axis ["Female", "Male"] y-axis "Percentage (%)" 0 --> 100 bar [22.2, 77.8]

As a first step, we asked an automated system to scan public forums, discussion boards, and thematic websites, gathering the ten most common questions this audience expresses about the non-profit’s field of activity. These questions become entry points—bridges between lived curiosity and archival insight.

A suspension bridge

The next phase involves:

This radar chart shows the website audience across six age groups, alongside gender distribution. Each axis displays the Website Audience ratio relative to the general population and the corresponding male and female percentages. Values above 100% indicate that the website has proportionally more visitors than the general population in that age group.

--- title: "Website Audience and Gender Distribution by Age Group" --- radar-beta axis a["18–24 (4%)"], b["25–34 (60%)"], c["35–44 (104%)"], d["45–54 (156%)"], e["55–64 (236%)"], f["65+ (284%)"] curve m["Male"]{77.8, 77.8, 77.8, 77.8, 77.8, 77.8} curve w["Age"]{4, 60, 104, 156, 236, 284} curve f["Female"]{22.2, 22.2, 22.2, 22.2, 22.2, 22.2} max 300 min 0

Through this approach, the archive evolves from a static collection into a responsive, audience-aware ecosystem—one that listens as much as it speaks.

Exploring Audience Concerns Through Q/A Sessions

To better understand the interests and hesitations of different audience segments, we ran Q/A sessions with ChatGPT. These sessions helped highlight real concerns and questions from readers, which can guide content curation and engagement strategies.

For example, when asking:

“What are 25- to 35-year-olds most concerned about when it comes to mindfulness?”

The responses revealed three main areas of concern:

  1. Time and consistency — many feel too busy to practice regularly.
  2. Authenticity — skepticism about mindfulness being over-commercialized or shallow.
  3. Effectiveness and safety — worry that mindfulness might not work or could bring up difficult emotions.

The non-profit operates with a very limited budget and cannot afford to hire a consulting group or conduct formal market research. To make the best possible well-informed, data-driven decisions, it relies on ChatGPT as a cost-effective tool for exploring audience concerns, identifying trends, and guiding content strategy based on publicly available information.

flowchart LR ScoreCheck[Target audience] ScoreCheck --> a[Topic 1 Time and consistency] ScoreCheck --> b[Topic 2 Authenticity] ScoreCheck --> c[Topic 3 Effectiveness & safety]

Insights like these allow the non-profit to tailor its content and digests, ensuring the corpus of articles speaks directly to the needs and questions of younger audiences, while also informing social media and micro-blogging strategies.

Curating Content and Quotes from Identified Topics

Once the system had identified key topics within the corpus, the next step was to leverage those insights for discovery and engagement.

For each topic, the workflow involved:

flowchart LR Files@{ shape: cyl, label: "1.8 million words 1,800 articles 7,000 pages" } Files --> C@{ shape: docs, label: "1 article per day for 5 years on website"} Files --> Quotes{Quote ranking & retrieval} Files --> ScoreCheck{Print - 200 pages digests by topic} Quotes --> blue[Micro-blogging] ScoreCheck --> E@{ shape: paper-tape, label: "Topic 1 Time and consistency"} ScoreCheck --> F@{ shape: paper-tape, label: "Topic 2 Authenticity"} ScoreCheck --> G@{ shape: paper-tape, label: "Topic 3 Effectiveness & safety"}

This process transforms the archive from a static repository into a living, thematic ecosystem, where content is both discoverable and shareable. Readers can explore topics in depth, while daily quotes keep the conversation active and ongoing, bridging the gap between long-form articles and real-time engagement.

From Files to Flow

All articles, thousands of them, form the foundation: the corpus. From there, three parallel processes begin:

Before any automated processing can help, the data must be clean and consistent: duplicates removed, formats standardized, and terminology aligned. This ensures a high-quality foundation for metadata generation, indexing, and content enrichment.


Scoring, Metadata, and Thematic Discovery

Metadata transforms the corpus into a discoverable and analyzable knowledge base. Automated workflows assist in reading each article, identifying topics, tags, and relevance scores, and maintaining consistency across the entire collection.

A Python-based workflow evaluates every article, assigning semantic scores that reflect each topic’s presence and intensity. The result is a harmonized dataset, where every page carries structured metadata—ready for discovery, filtering, or print curation.


From Digests to Dimensions

Once scored and organized, the corpus is ready to be recomposed into focused digests—each a lens for re-seeing the same field of knowledge from new angles.

These digests reveal different dimensions of the corpus, guiding readers to explore content thoughtfully and meaningfully.


From Quotes to Conversations

The best quotes don’t stay confined to pages—they move. Through micro-blogging, daily publishing, and social curation, the project opens channels for dialogue: the corpus becomes a conversation.

Each quote, ranked by resonance and context, is shared not as static text but as a living signal—bridging the long-form archive and the fast-moving web. Automated systems support this process by identifying patterns, key sentences, and recurring motifs, ensuring the most meaningful insights circulate outward.


Toward a Living Archive

This is more than content recycling; it’s content renewal. A living editorial process transforms static archives into intelligent, evolving knowledge.

Readers can:

The goal remains human: to distill clarity, preserve depth, and foster presence. Automation serves as an instrument for attention, not distraction—a tool to bring the archive to life again.


What emerges when archives breathe again isn’t noise—it’s continuity. A flow of attention, insight, and care that transforms reading into renewal.

← Back to Blog