Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The current order by id recommendation may be subtly introducing bugs. #274

Open
kitsunde opened this issue Nov 11, 2020 · 9 comments
Open

Comments

@kitsunde
Copy link

I would like to suggest removing https://rails.rubystyle.guide/#order-by-id as a recommendation, or at least rewording it under a chronological ordering headline and changing the recommendation so it's not causing unstable sorting issues.

The background is:

  • id (assumed to be an auto-incrementing sequence) will grant each row 1 unique id.
  • Databases don't guarantee order when ORDER BY is not explicitly dictating order.
  • created_at and such timestamp fields are not (commonly) unique fields.

This means that:

id,created_at
1,2020-01-01
2,2020-01-01

With ORDER BY created_at will get returned both as 1,2 and as 2,1 and this is the expected behaviour. These issues are quite common to find when doing pagination and finding the same record across multiple pages, but even repeatedly asking for the same records will cause records to swap places.

I think this style guide is mixing up presentation layer concerns like showing an order that makes sense to an end-user on a specific column (in this case created time, but it could be a title, updated time or any other non-unique column) - which is situational - with general application-level concerns.

If I wanted to really nitpick the recommendation then timestamps like created_at are not really more reliable than the sequence itself in terms of the chronological order of records since they are in rails commonly set by application servers where clocks have drifted, db calls arrive and execute at different speeds and native SQL like NOW() is frozen to the first invocation inside of a transaction across the duration of the active transaction (in pg anyways) which can take an arbitrarily long time. If anything the sequence is probably closer to the actual chronological order of the record being inserted.

If it's important to keep this recommendation then the id should probably be kept in the order call so queries remain deterministic I.e:

# bad
scope :chronological, -> { order(id: :asc) }

# good
scope :chronological, -> { order(created_at: :asc, id: :asc) }
@pirj
Copy link
Member

pirj commented Nov 11, 2020

Af far as I'm aware, created_at maps to TIMESTAMP DB type which has microsecond precision on Postgres and millisecond on MySQL by default. For MySQL Rails 6 is adding precision: 6 to timestamp columns to overcome the default.

More changes on the front of better handling of precision in Rails: [1, 2].

The case when timestamps would have identical microseconds are quite rare. Can't think of a case when one would want to use pagination to navigate through (quite lengthy supposedly) list of such models.

The current wording may be confusing as it contains two recommendations:

Don’t use the id column for ordering. The sequence of ids is not guaranteed

Use a timestamp column to order chronologically

Those two should not necessarily be followed together, and the latter is more an example of what the id ordering can be replaced with. But I agree with you that it would better hint that the ordering should be dictated by the business requirements.

Would you like to send a pull request to clarify this, @kitsunde?

Additionally, in some cases, e.g. for tree-structured comments where the primary sorting key may have gaps, sorting by such and id even in addition to a timestamp, would break pagination. It highly depends on the use case how to approach sorting.

@kitsunde
Copy link
Author

kitsunde commented Nov 11, 2020

I used a simple date for the sake of simplicity. But here is a fairly common concrete pattern:

now = Time.current
Book.insert_all([
  { title: "Rework", author: "David", created_at: now },
  { title: "Eloquent Ruby", author: "Russ", created_at: now }
])

Any batch operation pointing to a single timestamp, any operation that uses NOW() to create multiples in SQL would no matter precision guarantee the same timestamp. I would assume most applications would have this issue unless they are very small.

I'm happy to make a PR.

@pirj
Copy link
Member

pirj commented Nov 11, 2020

Again, I see what you mean, but again, for those bulk insert operations ordering by a tuple of created_at + id would result in exactly the same problems as just sorting by id, since created_at will be factored out.

I'd like to reiterate that the first part of the guideline, not sorting by id, still makes sense.
If you are up for rephrasing the second part of it, please send a pull request.

@rpbaltazar
Copy link

Again, I see what you mean, but again, for those bulk insert operations ordering by a tuple of created_at + id would result in exactly the same problems as just sorting by id, since created_at will be factored out.

But yet that would be the closest to the real chronological order.

I think that the juice is that chronological is highly dependent on business logic and implementation details, so probably the second part of the message should reinforce that idea and not only suggest the order by created_at.

@pirj
Copy link
Member

pirj commented Nov 12, 2020

Point is that the chronological order is just one of the possible use cases. Any other column combination might be a better option can be more desirable depending on business requirements. If we are talking of something that is not a blog for 15 mins. Chronological order is just one example of what should be used instead of ordering by id.

@pirj
Copy link
Member

pirj commented Feb 21, 2021

The current wording

Use a timestamp column to order chronologically

suggests ordering by created_at instead of the default ordering by id in the case when something is expected to be ordered chronologically. That is all that it tells.

created_at and such timestamp fields are not (commonly) unique fields.

I would disagree. Rails adds timestamps by default when the migration to create a table is generated.

Any batch operation pointing to a single timestamp, any operation that uses NOW() to create multiples in SQL would no matter precision guarantee the same timestamp.

It would be strange to expect those records to have any meaningful chronology for records created in a batch in such a way.

I agree with you that the current wording is misleading because it suggests using a timestamp column as the default ordering key and has no mention that it should be only used when the business logic requires chronological order.
A pull request to improve this is welcome.

@pirj pirj closed this as completed Feb 21, 2021
@mockdeep
Copy link

mockdeep commented Sep 13, 2021

I also find this recommendation to be problematic. As mentioned above, it's common for records to have the same created_at timestamp, as we often batch insert records. The only reliably unique way to order records is going to be by id, so I don't see why that shouldn't be recommended. This came up for us recently where a user had only 3 pages of results, but they were effectively coming up randomly, so they were not seeing some results and they saw duplicates of others. This bit us in the butt because it ended up in people not getting paid for their work.

I think this recommendation should be dropped from the guide, as it's very specific to what you're doing on the page. For pagination in particular, it's important that your ordering includes a unique column. We've added :id as the tiebreaker for all of our pagination regardless of other sorting.

Ordering by :created_at is also problematic in tests. It's caused us a great deal of flake when we're trying to assert against records being inserted in a particular order. Even if records are created in separate queries, it's not uncommon for them to end up with the same timestamp down to the microsecond. This is more a problem in longer test suite runs where Postgres has a greater tendency to return records in no particular order without a unique ordering.

@pirj
Copy link
Member

pirj commented Sep 14, 2021

@mockdeep Sounds reasonable.

What makes id better for pagination than created_at specifically is that it is unique, while created_at is not.
If the order of records returned by the same query when ordered by a non-unique column is not guaranteed by the three primary databases, we should not recommend created_at. Especially with the default timestamps precision being as low as milliseconds.
It's still ok to recommend against the id if there are good reasons to do so.

Do you think

The sequence of ids is not guaranteed to be in any particular order, despite often (incidentally) being chronological

is incorrect?

In the Rails world, using SERIAL is almost certain.
Even in multi-master DB setup and gaps in the sequence, the ordering of id should still be stable and nearly chronological.

Can't find any other justifications for this guideline in its current form off the top of my head.

Side note: "order by id" is confusing, as the guideline says exactly the opposite.

@mockdeep
Copy link

mockdeep commented Sep 15, 2021

Do you think

The sequence of ids is not guaranteed to be in any particular order, despite often (incidentally) being chronological

is incorrect?

@pirj I think it's maybe a little misleading. It is guaranteed to be in the order of the ids. The caveat I think it's trying to get at is that it might not be strictly chronological, but for the vast majority of Rails apps it is probably equivalent unless they're using UUIDs.

I can't think of a good general principle for this. There could maybe be a couple of guidelines, but I can't imagine how it could possibly be linted via RuboCop. Basically, it comes down to "be aware of the implications of how you order records in your application". But those implications can vary from application to application, and page to page in an application. Many of our tables allow sorting on arbitrary columns, and each of them should have a stable tie breaker when the sort property is not unique. Sorting by status is a common example in our codebase.

@pirj pirj reopened this Sep 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants