Decentralised Feeds & Web Searches

02-13-01, 11-04-26, 1330 1301 6 am26

Decent Partner: XML to Database feed parsing

Purpose

This document explains the conceptual process of handling XML sitemaps from WordPress (or similar CMS systems) and storing their contents in a database. The goal is to provide a clear framework for differentiating between sitemap indexes and URL sitemaps, and to outline how a system can navigate from the top-level feed down to actual content URLs.

Core Concepts

Process Overview

  1. Entry Point: Begin with the top-level sitemap (commonly /wp-sitemap.xml).
  2. Identify Type: Inspect the root element.
    • If <sitemapindex> → treat as a directory of other sitemaps.
    • If <urlset> → treat as a list of content URLs.
  3. Branching:
    • For <sitemap> entries, follow each <loc> link to another XML file.
    • For <url> entries, collect the <loc> values as actual content URLs.
  4. Recursion: Continue following <sitemap> entries until only <url> entries remain.
  5. Storage: Insert the collected URLs and metadata into the database. The top-level sitemap itself can be stored as a reference, but the key data is the list of content URLs.

Conceptual Flow

  wp-sitemap.xml
     ├── wp-sitemap-posts-post-1.xml
     │       ├── URL 1
     │       ├── URL 2
     │       └── ...
     ├── wp-sitemap-posts-page-1.xml
     │       ├── URL A
     │       ├── URL B
     │       └── ...
     └── wp-sitemap-taxonomies-category-1.xml
             ├── Category URL X
             └── Category URL Y
    

Key Differentiation

The simplest way to differentiate between sitemap types is by checking the elements inside:

Best Practices

Analogy

Think of the sitemap index as a table of contents. Each child sitemap is a chapter, and each <url> entry is a page. You don’t need to store every chapter file separately; you just need to know how to navigate from the table of contents down to the pages.

Conclusion

By differentiating between <sitemap> and <url> elements, a system can reliably traverse from the top-level sitemap index down to the actual content URLs. This process ensures efficient storage in the database and keeps the system aligned with search engine standards for sitemap handling.

Appendix: Proof of the Pudding in Domain Maps

The effectiveness of our sitemap ingestion and keyword mapping process can be demonstrated directly in the Domain Map

How the Domain Map Functions

  • Entry Point: Each partner domain in the system has a link to its Domain Map.
  • Keyword Mapping: The Domain Map displays all extracted keywords associated with that domain.
  • Page Associations: Each keyword is linked to the specific pages discovered through the sitemap crawl.
  • Navigation: Users can click a keyword to see the list of pages where it appears, confirming the keyword-to-page relationship.

Why This Is Proof

The Domain Map is the visible outcome of the entire pipeline:

  1. Sitemap index is parsed and child sitemaps expanded.
  2. Content URLs are collected and stored.
  3. Keywords are extracted from each page.
  4. Keywords are mapped back to their originating pages.
  5. The Domain Map presents this mapping in a user-facing view.

Validation

By visiting the Domain Map linked to a partner domain, one can verify:

  • That the sitemap ingestion process correctly discovered all relevant pages.
  • That keyword extraction is functioning as intended.
  • That the keyword-to-page mapping is accurate and complete.

Conclusion

The Domain Map acts as the “proof of the pudding” — a direct, navigable representation of how sitemaps, content URLs, and keyword indexing converge into a coherent partner domain listing. It is both a diagnostic tool and a demonstration of system integrity.

1348 engine.php VC class decent function findViewPath viewPath /home/leonwool/public_html/biscuits/biscuits-84s/views/js-monitor.php

Loading comments…
(last activity recently ago)