Danbooru

Failbooru

Posted under General

ultima said:
For tag counts, maybe set a threshold so after a tag reaches a certain number of posts (1,000? 5,000? 10,000?) it stops counting them individually and just says "1000+" or whatever. This way the tags with low numbers still stick out, which is important for reasons RaisingK and Cyberia-Mix mentioned.

And how'd that help with the DB load? Things don't become faster just by being less useful.

It might help... if PostgreSQL makes any kind of distinction like that, which I doubt. AFAIK SQL doesn't even have any facility to understand fallbacks like that, so in essence you'd be asking for the total tag count, and then throwing it away if it was larger than 10000, which is obviously not useful. But people not familiar with SQL might not know that you can't micromanage how a database executes queries.

葉月 said:
And how'd that help with the DB load? Things don't become faster just by being less useful.

He means, when a tag reaches, say, a thousand posts, the site can stop calculating its number (thus not using any resources for the calculation) and say "1000+". No idea if that's actually a helpful idea, but it seems like what he's proposing.

Magus said:
He means, when a tag reaches, say, a thousand posts, the site can stop calculating its number (thus not using any resources for the calculation) and say "1000+". No idea if that's actually a helpful idea, but it seems like what he's proposing.

Hazuki was saying that it wouldn't be helpful, because the time sink is in the database lookup, not in any calculations regarding the number itself.

Fencedude said: Ok, we absolutely have to have the numbers next to the tags on the post page. Thats just a fundamental requirement.

葉月 said: And we absolutely have to have the user stats aggregated somewhere. Seeing the counts for comments, forum posts, uploads,tag edits of the user is absolutely essential to moderation.

Agreed very very very very very very very strongly on both points.

Yes, easy for me so say when I don't have to code it or pay the server bills, I know. But these are utterly fundamental aspects of the navigation and use of danbooru for me.

I think it'd be just as important for a new user to see those numbers on the post pages as well. One incredibly easy way to tell if you used the wrong tag or made a typo when editing tags is to check and see if the number next to it comes up as "1".

jxh2154 said:
Agreed very x7 strongly on both points etc.

I support this stance also - for me, tag counts are critical to navigating and enjoying such a vast site as this one.

I suppose that all I can do is wish the admins the best of luck and patience in working out all the bugs.

The more I work without the statistics, the more I miss them. If they were the bottleneck though, I'm not sure what the best solution would be. Unlike the user stats, the related tags were already being cached, and with the new algorithm wouldn't be accurate even if displayed anymore.

Maybe just reducing the frequency of the caching would have helped? If we cached the base tag frequencies once a day or so, grabbing that number to display on the post view pages shouldn't take any more load than it takes to get the tag type to set the colors. Related tags for search queries would be another issue though.

I'm definitely seeing less failbooru's now though, (though Hazuki had noted he's still having issues with profiles).

Shinjidude said: I'm definitely seeing less failbooru's now though, (though Hazuki had noted he's still having issues with profiles).

It's still pretty prevalent on the implications page, though not 100% of the time anymore.

My big use outta the tag counts is for typos. Perhaps an indicator for when a tag you've added is the only one on any non-deleted posts? Like a star or arrow or highlight, anything, so long as it's not worse a strain than the numbers themselves.

T5J8F8 said:
My big use outta the tag counts is for typos. Perhaps an indicator for when a tag you've added is the only one on any non-deleted posts? Like a star or arrow or highlight, anything, so long as it's not worse a strain than the numbers themselves.

Argh, stop that. Either you have a clue about the DB, and then you offer actual, solid, backed-up advice about the DB matters, or you shut up until albert speaks up and offers some explanations/data to reason about. The clue-free ideas you're throwing around right now are all:

1. Worthless
2. Hacking around the temporary bandaid measures as if they were set in stone, instead of fixing the actual issue, which is server is goddamn slow
3. Spam which makes it even less likely that albert will come back to the thread and notice the technical discussion, such as the request to share the logs allowing us make some informed choices and proposals

This holds for everyone: if you don't a have real understanding of the DB which you use to give solid advice, and you're not reporting a new issue, shut up and wait until albert comes back. We don't need any more me-too's and blind guesses.

Sorry I haven't responded much, I've been busy with work.

I know for a fact that the recent tag changes helped with DB load because CPU load went from 20 on average down to 5. As for what caused the sudden spike, I can't tell you because I've only been monitor the database logs recently. Traffic is increasing, however.

There's no disk swapping going on. IOwait is low. CPU load is low. Bandwidth is no longer being throttled. If you have any other ideas for what might be degrading performance or queries I should run let me know.

I will readd counts. But you should also know there's no way for me to get an accurate count for related tags. The numbers you'll see will work similarly to what you see on the front page currently. Basically they'll only count the 200 most recent posts for that tag.

For the user profiles, real time reports are hard to do. The tables are too big at this point. Post versions has 3.4 million rows, favorites has nearly 8 million rows. Caching is hard because when you view someone's profile, especially if it's restricted to janitors only, you're probably the first person to view it. Probably the best solution is to offset stat collection to the daemon and cache the page somewhere. That's a lot of work however. Would be great if someone could write a patch ^^;;

Which statistics do janitors need the most? If I take favorite count out that'd help a lot.

albert said:
Which statistics do janitors need the most? If I take favorite count out that'd help a lot.

We'd need to keep the post count, deleted post count, and total tag edits for sure. Note, wiki, and pool counts are also pretty important for catching vandals (or contributors), too.

As for the total comment count, we can probably do without, since we usually catch retarded comments when browsing the site regularly, and then viewing a user's profile afterwards.

Updated

albert said:
Which statistics do janitors need the most? If I take favorite count out that'd help a lot.

Posts and deletions, mostly. Note updates would be nice for ideas on who to keep an eye on for invitations as translators.

albert said:
There's no disk swapping going on. IOwait is low. CPU load is low. Bandwidth is no longer being throttled. If you have any other ideas for what might be degrading performance or queries I should run let me know.

That's odd. I don't want to sound stupid by telling you what you know perfectly well, but are all the tables indexed where it matters, vacuumed, you have taken a deep look at EXPLAIN to make sure it's not doing something stupid? No strange or experimental operation modes? Perhaps materialised views? AFAIR pgsql doesn't have bona fide materialised views, but people have been faking them with triggers. This message in particular mentions count() as the source of slowness.

Also, there's pgmemcache, maybe that could help by offloading some queries.

Which statistics do janitors need the most? If I take favorite count out that'd help a lot.

Favourite counts are least important. Everything else is quite crucial as it lets us catch vandals / spammers / tag gardeners quickly.

I agree that posts, deletions, and notes are the most important, favorites the least, but as Hazuki says they are all useful, and illustrate different metrics to gauge user involvement.

Is there any technical reason we can't maintain the counts in the same table alongside user name and id? (Forgive me for not really knowing Danbooru's internal DB schema. There wouldn't happen to be a data dictionary available somewhere, would there?)

If we did that, gathering the info for a profile listing would be constant rather than requiring a bunch of expensive queries.

We could maintain these by running the expensive queries periodically (maybe once a day, although even that might be a bit much given the number of users) and caching them.

Perhaps better, we could keep an accurate and up-to-date count by setting a trigger to increment/decrement them with every function that adds/removes records from the relevant tables. That would add a constant cost per update, but would save a lot of work when it came to collecting the summary counts.

This would be most effective for actions that are done more infrequently (I imagine for most active contributors favorites would be less frequent than tag updates). If favorite counts are one of the more expensive queries to run, this would be doubly beneficial.

The additional work would also be proportional to a user's activity. There might be some significant additional load added for very active users that do a lot, but on the other hand in the long tail of less active members, you would see a lot fewer actions, and thus less additional cost. The stat retrieval would be the same low cost for everyone.

Updated

I would like very much to get rid of COUNT() calls but people here seem adamant on keeping it! Specifically, completely getting rid of the numbered paginator would spare me the cost of doing a COUNT() on multitag searches, among other things. I would also like to ban negated-tag-only searches since that would finally let me get rid of the posts_tags table (now at 8 million rows). Favorites right now are slow (also 8 million rows). I will probably have to backport the partitioned favorites implementation from Danbooru2.

A batch for caching stats for everyone, even on a nightly basis, isn't feasible with the number of users and the lack of any pattern on who you want to look up. And I'm loathe to add extra overhead to common functionality just to benefit a few users. I think taking out fav counts will help since that involved an ugly join, and database loads are lower now then what they were.

For profiles, I think that Posts, Deleted, and Tags are the only ones we "need" on hand at all times. Would it be possible to implement a system for the others where it will only fetch them if we specifically request it? Would that help keep the overhead down?

Or perhaps making it so only Janitors and up can see the actual active tag counts?

I really have no idea how this stuff works, but those are my suggestions.

Also, I don't think we really need the tag counts for the "related" tags when you do a search, in that situation what is more important is that the tags are related, not how many posts each specifically has. We only need to know the actual tag count (with whatever caching you have to do) on the actual Post page itself.

albert said:
completely getting rid of the numbered paginator

Is this something people have shown opposition towards? The only reason I ever use the numbered paginator is to find the earliest posts under a given tag, which an order:date_asc metatag could do just as well.

Also, you may have already seen this mentioned in forum #39091 or elsewhere, but I'll bring it up again here: one way of maintaining functionality while reducing load would be AJAXing in requested statistics dynamically.

1 2 3 4 5 6 7 8 9 12