feat(sitemap): add lastmod to extracted page output#301
Conversation
|
@nicklamonov please check readme |
| "type": "string", | ||
| "title": "Results", | ||
| "description": "Dataset items with discovered URLs and crawl status.", | ||
| "description": "Dataset items with discovered URLs, crawl status, and last-modification info.", |
There was a problem hiding this comment.
Rather "date-time of last modification."
| - **Efficiency:** Uses HTTP HEAD requests for URL validation, which are significantly faster and consume less bandwidth than full GET requests. | ||
| - **Proxy Support:** Integrated with Apify Proxy to prevent rate limiting or blocking during the discovery phase. | ||
| - **Detailed Output:** Provides the final URL and the corresponding HTTP status code. | ||
| - **Detailed Output:** Provides the final URL, the corresponding HTTP status code, and the page's last-modification time. |
|
Readme is ok (besides a small nit mentioned). If you could build it for testing under some tag, I should be able to validate the functionality too. Thank you! |
|
@nicklamonov https://console.apify.com/admin/users/ZscMwFR5H7eCtWtyh/actors/rGeTNESChDZ65EbYh/source |
ruocco-l
left a comment
There was a problem hiding this comment.
I was about to say that in case there is no information about it we would end up with every item having null which is not really pleasing, and could be avoided by using undefined, which will not create the column in the dataset if no item has the property, but I guess printing null is still valid information and it does not leave you wonder in case you have it in one run and not in the others. LGTM 👍
|
Small question: Now it's like this:
But ideally I'd reverse them, so they are displayed in following priority (like in json):
Also, is it ok that we don't have Otherwise, from functional perspective it looks good. |
nicklamonov
left a comment
There was a problem hiding this comment.
Great!
Thanks!
Looks good.
Each page in the Sitemap Extractor output now carries a lastmod field with the page's last-modification time.
The lastmod is taken from two sources in order of preference:
The tag declared for the URL in the sitemap.
The Last-Modified response header, used only when the sitemap has no .
If neither is present, lastmod is null.
Output example
Closes #281