Skip to content

feat(sitemap): add lastmod to extracted page output#301

Merged
nikitachapovskii-dev merged 4 commits into
masterfrom
chore/add-lastmod-datetime-to-output
Jun 29, 2026
Merged

feat(sitemap): add lastmod to extracted page output#301
nikitachapovskii-dev merged 4 commits into
masterfrom
chore/add-lastmod-datetime-to-output

Conversation

@nikitachapovskii-dev

@nikitachapovskii-dev nikitachapovskii-dev commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Each page in the Sitemap Extractor output now carries a lastmod field with the page's last-modification time.

The lastmod is taken from two sources in order of preference:

The tag declared for the URL in the sitemap.
The Last-Modified response header, used only when the sitemap has no .
If neither is present, lastmod is null.

Output example

{
  "url": "https://example.com/page",
  "status": 200,
  "lastmod": "2024-01-01T00:00:00.000Z"
}

Closes #281

@nikitachapovskii-dev

Copy link
Copy Markdown
Contributor Author

@nicklamonov please check readme

@nikitachapovskii-dev nikitachapovskii-dev self-assigned this Jun 25, 2026
@nicklamonov

Copy link
Copy Markdown
Contributor

Closes #281, not #301

"type": "string",
"title": "Results",
"description": "Dataset items with discovered URLs and crawl status.",
"description": "Dataset items with discovered URLs, crawl status, and last-modification info.",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather "date-time of last modification."

- **Efficiency:** Uses HTTP HEAD requests for URL validation, which are significantly faster and consume less bandwidth than full GET requests.
- **Proxy Support:** Integrated with Apify Proxy to prevent rate limiting or blocking during the discovery phase.
- **Detailed Output:** Provides the final URL and the corresponding HTTP status code.
- **Detailed Output:** Provides the final URL, the corresponding HTTP status code, and the page's last-modification time.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

@nicklamonov

Copy link
Copy Markdown
Contributor

Readme is ok (besides a small nit mentioned).

If you could build it for testing under some tag, I should be able to validate the functionality too.

Thank you!

@nikitachapovskii-dev

nikitachapovskii-dev commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

@nicklamonov https://console.apify.com/admin/users/ZscMwFR5H7eCtWtyh/actors/rGeTNESChDZ65EbYh/source
please test under 0.0
it has a different branch, forked revive-monorepo + this exact changes so it can be built on platform

@ruocco-l ruocco-l left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was about to say that in case there is no information about it we would end up with every item having null which is not really pleasing, and could be avoided by using undefined, which will not create the column in the dataset if no item has the property, but I guess printing null is still valid information and it does not leave you wonder in case you have it in one run and not in the others. LGTM 👍

@nicklamonov

Copy link
Copy Markdown
Contributor

Small question:
Are we able to change the order of fields in the Output in Console?

Now it's like this:

  1. lastmod
  2. status
  3. url

But ideally I'd reverse them, so they are displayed in following priority (like in json):

  1. url
  2. status
  3. lastmod

Also, is it ok that we don't have lastmod in the dataset schema (in actor.json)? Is it because we may not poppulate it if all are nulls or was it just missed?

Otherwise, from functional perspective it looks good.

@nikitachapovskii-dev

Copy link
Copy Markdown
Contributor Author

https://console.apify.com/view/runs/xcMCso8kBMsgGgmUK

cc @nicklamonov

@nicklamonov nicklamonov left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!
Thanks!
Looks good.

@nikitachapovskii-dev nikitachapovskii-dev merged commit 8c7fddf into master Jun 29, 2026
2 of 4 checks passed
@nikitachapovskii-dev nikitachapovskii-dev deleted the chore/add-lastmod-datetime-to-output branch June 29, 2026 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sitemap Extractor: Add lastmod datetime to output

4 participants