If I crawl the full website data of nuget.org, is there any potential issue? #10405

tonyqus · 2025-04-15T18:33:00Z

tonyqus
Apr 15, 2025

As discussed in #10221, I'm going to write a spider program to crawl the full site data of nuget.org. The goal is to recreate a new nuget-trend website with advanced statistics charts. Although https://nugettrends.com/ is there, it's a half-done project. A few features are not implemented. I did check the code 2 years ago.

Nuget team is very slow in working on advanced statistics tickets. A few statistics tickets have been there for years (some are open since 2017).

There are 43 open issues under Area:Statistics category. Check this

I'd like to know if there is any concern from nuget team if I wanna crawl the data from nuget.org. And I know that you cannot share your data because there are some PII information in your database.

joelverhagen · 2025-04-15T18:52:35Z

joelverhagen
Apr 15, 2025
Collaborator

I would like to avoid a situation where an runaway automated crawling process causes undue load on our service. We have tools in place to mitigate some of the problems and, indeed, our site has been live for over a decade and undoubtedly crawled by manage search index crawlers.

What data are you interested in collecting?

13 replies

joelverhagen Apr 15, 2025
Collaborator

Please identify your requests with a clear user agent string and be responsible with your request volume. If you'd like to get a review on your query patterns or how you plan on gathering data, please feel free to reach out so we can give our ideas.

Personally, I feel that it makes more sense to do things related to download counts inside the NuGet Trends project. NuGet Trends has a lot of adoption already and is part of the .NET Foundation.

It is up to you on whether you want to create a new project. For other package analysis not related to download counts, a new project could make sense. But it sounds like you have some ideas on what to do next. Good luck!

tonyqus Apr 15, 2025
Author

I have been in .NET foundation for 4 years. I have to complain that this foundation doesn't operate anymore. In this foundation or not in, there is no difference. 70% of the DNF projects have died.

tonyqus Apr 15, 2025
Author

And I'm confident that my new nuget trend website will win the game eventually and become a good assistant to nuget package owners such as me.

Again, I'm heavily relying on nuget.org data. Two weeks ago, I just finished one small task to solve NPOI nuget ranking issue

tonyqus Apr 26, 2025
Author

It is not possible to enumerate all packages by crawling NuGet.org. Search browsing depth is limited.

I notice that I can only get about top 3000 packages with top(n)=10000. Is it the limitation you mentioned?

joelverhagen Apr 26, 2025
Collaborator

Yes. The catalog will provide the full index and is cached via CDN (superior performance).

tonyqus · 2025-04-26T15:17:21Z

tonyqus
Apr 26, 2025
Author

@joelverhagen Can you help me check why there is issue #10427?

0 replies

If I crawl the full website data of nuget.org, is there any potential issue? #10405

Uh oh!

Uh oh!

tonyqus Apr 15, 2025

Replies: 2 comments · 13 replies

Uh oh!

joelverhagen Apr 15, 2025 Collaborator

Uh oh!

joelverhagen Apr 15, 2025 Collaborator

Uh oh!

tonyqus Apr 15, 2025 Author

Uh oh!

Uh oh!

tonyqus Apr 15, 2025 Author

Uh oh!

tonyqus Apr 26, 2025 Author

Uh oh!

Uh oh!

joelverhagen Apr 26, 2025 Collaborator

Uh oh!

Uh oh!

tonyqus Apr 26, 2025 Author

tonyqus
Apr 15, 2025

Replies: 2 comments 13 replies

joelverhagen
Apr 15, 2025
Collaborator

joelverhagen Apr 15, 2025
Collaborator

tonyqus Apr 15, 2025
Author

tonyqus Apr 15, 2025
Author

tonyqus Apr 26, 2025
Author

joelverhagen Apr 26, 2025
Collaborator

tonyqus
Apr 26, 2025
Author