Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: add support for LU factorization in the linalg extension #627

Open
ogrisel opened this issue May 5, 2023 · 50 comments · May be fixed by #630
Open

RFC: add support for LU factorization in the linalg extension #627

ogrisel opened this issue May 5, 2023 · 50 comments · May be fixed by #630
Labels
Needs Discussion Needs further discussion. RFC Request for comments. Feature requests and proposed changes. topic: Linear Algebra Linear algebra.

Comments

@ogrisel
Copy link

ogrisel commented May 5, 2023

It seems that many libraries that are candidates to implement the Array API namespace already implement the LU factorization (with variations in API and with the notable exception of numpy).

However LU is not part of the list of linear algebra operations of the current state of the SPEC:

Are there any plans to consider it for inclusion?

@rgommers
Copy link
Member

rgommers commented May 8, 2023

Thanks for asking @ogrisel. I had a look through the initial issues which considered the various linalg APIs, and LU decomposition was not considered at all there. The main reason I think being that the overview started from what is present in NumPy, and then looked at matching APIs in other libraries.

I think it's an uphill battle for now. It would require adding it to numpy.linalg, moving it in other libraries with numpy-matching APIs (e.g., https://docs.cupy.dev/en/stable/reference/scipy_linalg.html#decompositions is in the wrong place), and then aligning on APIs also with PyTorch & co. Finally, there's lu but also lu_solve and lu_factor - would it be just one of those, or 2/3?

It seems to me that LU decomposition is important enough that it's worth working on. So we could figure out what the optimal API for it would be, and then adding it to array-api-compat so it can be used in scikit-learn and SciPy. That can be done on pretty short notice I think. From there to actually standardizing it would take quite a long time I suspect (but nothing is really blocked on not having that done).

@rgommers
Copy link
Member

rgommers commented May 17, 2023

The signatures to consider:

  • SciPy/cupyx.scipy/jax.scipy: lu(a, permute_l=False, overwrite_a=False, check_finite=True)
  • PyTorch: torch.linalg.lu(A, *, pivot=True, out=None)

The overwrite_a, check_finite and out keywords should all be out of scope for the standard.

The permute_l/pivot keywords do seem relevant to include. They control the return values in a different way. SciPy's permute_l returns 3 arrays if False, 2 arrays if True. That breaks a key design rule for the array API standard (no polymorphic APIs), so we can't do that. The PyTorch pivot=True behavior is okay, it always returns: a named tuple (P, L, U), and leaves P as an empty array for the non-default pivot=False.

PyTorch defaults to partial pivoting, and the keyword allows no pivoting. An LU decomposition with full pivoting is also a thing mathematically, but it does not seem implemented. JAX also has jax.lax.linalg.lu, which only does partial pivoting.

So it seems like lu(x, /) -> namedtuple(array, array, array): which defaults to partial pivoting is the minimum needed, the question is whether the other pivoting mode(s) is/are needed.

@rgommers
Copy link
Member

dask.array.linalg.lu has no keywords at all, and no info in the docstring about what is implemented. From the tests it's clear that it matches the SciPy default (permute_l=False).

@rgommers
Copy link
Member

For PyTorch, the non-default flavor is only implemented on GPU:

>>> A = torch.randn(3, 2)
... P, L, U = torch.linalg.lu(A)
>>> A = torch.randn(3, 2)
... P, L, U = torch.linalg.lu(A, pivot=False)
Traceback (most recent call last):
  Cell In[6], line 2
    P, L, U = torch.linalg.lu(A, pivot=False)
RuntimeError: linalg.lu_factor: LU without pivoting is not implemented on the CPU

Its docstring also notes: The LU decomposition without pivoting may not exist if any of the principal minors of A is singular.

tl;dr maybe the best way to go is to only implement partial pivoting?

@ogrisel
Copy link
Author

ogrisel commented May 17, 2023

Maybe we can start with a function with no argument that always returns PLU (that is only implement scipy's permute_L=False and torch's pivot=True) and it will be up to the consumer to compute.

On the other hand, I think it would be good to have an option wot do the PL product automatically and avoid allocating P. Should array API expose two methods? xp.linalg.lu that outputs a 3-tuple (P, L, U) a second function xp.linalg.permuted_lu that precomputes the PL product and always outputs a 2-tuple (P @ L, U)?

Its docstring also notes: The LU decomposition without pivoting may not exist if any of the principal minors of A is singular.

Also note, from PyTorch's doc:

The LU decomposition is almost never unique, as often there are different permutation matrices that can yield different LU decompositions. As such, different platforms, like SciPy, or inputs on different devices, may produce different valid decompositions.

Such a disclaimer should probably be mentioned in the Array API spec.

@ogrisel
Copy link
Author

ogrisel commented May 17, 2023

Note that scipy.linalg.lu calls:

flu, = get_flinalg_funcs(('lu',), (a1,))
p, l, u, info = flu(a1, permute_l=permute_l, overwrite_a=overwrite_a)

and flu is therefore not polymorphic internally but it p is a 1x1 array with a single 0 value when permute_l is True.

rgommers added a commit to rgommers/array-api that referenced this issue May 18, 2023
Only the default (partial pivoting) algorithm that is implemented
in all libraries and for all devices is added here. data-apisgh-627
has details on the no-pivoting case, but it's not universally supported
and the only reason to add it would be that it's more performant in
some cases where users know it will be numerically stable.
Such an addition can be done in the future, but it seems like a
potentially large amount of work for implementers for limited gain.

Closes data-apisgh-627
@rgommers rgommers linked a pull request May 18, 2023 that will close this issue
@rgommers
Copy link
Member

@ogrisel I opened gh-630 for the default (partial pivoting) case that seems supportable by all libraries.

On the other hand, I think it would be good to have an option wot do the PL product automatically and avoid allocating P. Should array API expose two methods? xp.linalg.lu that outputs a 3-tuple (P, L, U) a second function xp.linalg.permuted_lu that precomputes the PL product and always outputs a 2-tuple (P @ L, U)?

Perhaps. The alternative of having an empty P like PyTorch does may work, but it's not ideal. JAX would have to preallocate a full-size array in case a keyword is used and it's not literal True/False.

Given that this use case seems more niche and it's not supported by Dask and by PyTorch on CPU, and you don't need it now in scikit-learn, it seems waiting for a stronger need for this seems like the way to go here though.

@ogrisel
Copy link
Author

ogrisel commented May 19, 2023

We do use the "permute_l=True" case in scikit-learn.

@ogrisel
Copy link
Author

ogrisel commented May 19, 2023

It would be easy to provide a fallback implementation that uses an extra temporary allocation + mm product for libraries that do not natively support scipy's permute_l=True.

But it's not clear if pytorch' pivot=False is equivalent to scipy permute_l=True or doing something different.

@rgommers
Copy link
Member

We do use the "permute_l=True" case in scikit-learn.

Oops yes, I didn't read well.

But it's not clear if pytorch' pivot=False is equivalent to scipy permute_l=True or doing something different.

It looks like it's doing the exact same thing:

>>> import torch
>>> import numpy as np
>>> from scipy import linalg

>>> t = torch.rand(20, 10, device='cuda')  # must be 'cuda', because `pivot=False` is not supported on CPU
>>> Pt, Lt, Ut = torch.linalg.lu(t.to('cuda'), pivot=False)
>>> Lt.shape
torch.Size([20, 10])
>>> Ut.shape
torch.Size([10, 10])
>>> Pt
tensor([], device='cuda:0')

>>> Ln, Un = linalg.lu(t.to('cpu').numpy(), permute_l=True)
>>> Ln.shape
(20, 10)
>>> Un.shape
(10, 10)
>>> np.testing.assert_allclose(Lt.to('cpu'), Ln)
...
AssertionError: 
Not equal to tolerance rtol=1e-07, atol=0

Mismatched elements: 190 / 200 (95%)
Max absolute difference: 56.958973
Max relative difference: 181.88818

>>> (Lt @ Ut - t).max()
tensor(2.1011e-06, device='cuda:0')
>>> (Ln @ Un - t.to('cpu').numpy()).max()
1.1920929e-07

So both give valid decompositions with the same shapes for L and U, and P either empty or not returned; numerical values are very different but that's expected.

One other thing to point out, the non-pivoting case does result in a decomposition with larger errors for PyTorch:

>>> Pt, Lt, Ut = torch.linalg.lu(t)
>>> (Pt @ Lt @ Ut - t).abs().max()
tensor(1.4901e-07, device='cuda:0')
>>> Pt, Lt, Ut = torch.linalg.lu(t, pivot=False)
>>> (Lt @ Ut - t).abs().max()
tensor(2.1011e-06, device='cuda:0')

>>> Pn, Ln, Un = linalg.lu(x)
>>> np.abs((Pn @ Ln @ Un - x)).max()
1.1920929e-07
>>> PLn, Un = linalg.lu(x, permute_l=True)
>>> np.abs((PLn @ Un - x)).max()
1.1920929e-07

SciPy's pivoting decomposition seems to be a lot faster for small arrays, and has similar performance for large arrays when they are square or wide, and is several times slower when they're tall:

>>> %timeit linalg.lu(x)  # shape (20, 10)
8.68 µs ± 38.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
>>> %timeit linalg.lu(x, permute_l=True)
21.7 µs ± 80.2 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

>>> %timeit linalg.lu(x)  # shape (400, 300)
951 µs ± 2.27 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit linalg.lu(x, permute_l=True)
832 µs ± 7.62 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit Pn @ Ln
190 µs ± 42.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

>>> x = np.random.randn(1000, 1000)
>>> %timeit linalg.lu(x)
8.49 ms ± 9.89 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit linalg.lu(x, permute_l=True)
7.94 ms ± 19.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> x = np.random.randn(1000, 50)  # tall array
>>> %timeit linalg.lu(x)
734 µs ± 2.21 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit linalg.lu(x, permute_l=True)
250 µs ± 2.21 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> x = np.random.randn(50, 1000)  # wide array
>>> %timeit linalg.lu(x)
220 µs ± 303 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit linalg.lu(x, permute_l=True)
222 µs ± 4.48 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Given that scikit-learn does use permute_l=True and I suspect the tall array case is quite common for ML applications, there does seem to be a case for providing both flavors.

@rgommers
Copy link
Member

For completeness, some PyTorch GPU timings:

>>> t = torch.rand(1000, 1000, device='cuda')  # square tensor
>>> %timeit torch.linalg.lu(t)
7.11 ms ± 382 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit torch.linalg.lu(t, pivot=False)
1.28 ms ± 2.47 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

>>> t = torch.rand(1000, 50, device='cuda')  # tall tensor
>>> %timeit torch.linalg.lu(t)
2.15 ms ± 54.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit torch.linalg.lu(t, pivot=False)
343 µs ± 6.58 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

>>> t = torch.rand(50, 1000, device='cuda')  # wide tensor
>>> %timeit torch.linalg.lu(t)
1.21 ms ± 2.23 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit torch.linalg.lu(t, pivot=False)
488 µs ± 800 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

The performance gain is even larger there. So we have a default algorithm that's slower but stable, and an opt-in to a significantly faster algorithm.

@rgommers
Copy link
Member

@lezcano @nikitaved it'd be nice to get your opinion on this one. I can't find a PyTorch feature request for LU decomposition with pivot=False on CPU, despite it being much faster potentially. Is there a reason why it's only implemented on GPU?

@nikitaved
Copy link

nikitaved commented May 19, 2023

@rgommers , if there is something for the GPU which is missing for the CPU in PyTorch, that is most likely because of gpu performance/functionality having much higher value... Especially considering the usage in DL applications. However, sometimes PyTorch is too strict and follows the suite, like with SVD, iirc. PyTorch produces the full SVD by default, even though it is not required in most cases, and hurts gpu performance overall... I, personally, would not mind having pivot=False for the CPU in PyTorch. We could create a feature request there, but I doubt the core team will be interested. But it might be a very nice on-boarding task though...

@lezcano
Copy link
Contributor

lezcano commented May 19, 2023

pivot=False is often quite risky, as the class of matrices for which the decomposition has a not-so-nice structure (all its principal minors must be away from zero). For example, it does not exist for a matrix A with A[0,0] = 0. With this limitation, there are very few families of matrices for which this property holds. One of the few families for which this condition holds is positive definite matrices (by Sylvester's criterion), but then the LU decomposition is just good ol' Cholesky, for which we have a specific algorithm that's x2 faster.

Partial pivoting is often a good compromise between stability and speed when compared to, say, full pivoting or other more general decompositions. For this reasons, I don't think that there are many practical cases where you would want to have a regular LU decomposition.

@nikitaved
Copy link

nikitaved commented May 19, 2023

For example, it does not exist for a matrix A with A[0,0] = 0. With this limitation, there are very few families of matrices for which this property holds.

I would not say few, there are plenty of matrices that satisfy such conditions, like full-rank matrices with dominant diagonals. And in most cases diagonals are pre-conditioned anyway. Besides, pivot=True is the default, so I would assume the user is aware of the structure of matrices at hand when pivot=False is chosen.

Partial pivoting is often a good compromise between stability and speed when compared to, say, full pivoting or other more general decompositions. For this reasons, I don't think that there are many practical cases where you would want to have a regular LU decomposition.

There are cases when a single decomposition of a non-symmetric matrix leads to solving linear systems when the rhs comes in a stream, like in NNs that solve a system during training, and then (or while) the corresponding weight matrix could be decomposed to be able to do efficient inference.

@lezcano
Copy link
Contributor

lezcano commented May 19, 2023

The diagonal case you mentioned I don't think it's correct, but it's true that there is at least one family of matrices that I know of for which LU decompositions exist, which is column diagonally-dominant matrices.

At any rate, in most cases you do want partial pivoting due to stability reasons, so yeah. This could be implemented in PyTorch, so it shouldn't stop the array API from standarising it. I just think it shouldn't be the default.

Also, in terms of standarisation, I prefer the API we have in PyTorch for lu_solve over the one exposed in SciPy. In particular, we don't expose the trans parameter, but simply act on the strides of the input tensor to choose the best value for this parameter. We also act on the conj-bit that PyTorch implements to solve A^H X = B, but I don't know if NumPy implements that.

@nikitaved
Copy link

nikitaved commented May 19, 2023

The diagonal case you mentioned I don't think it's correct, but it's true that there is at least one family of matrices that I know of for which LU decompositions exist, which is column diagonally-dominant matrices.

Agree. Fixed and changed to diagonal dominance which I meant in the first place :)

But on the topic of flexibility and performance my opinion is that having control over pivot is a plus. In PyTorch we even have these _ex methods that do not inform about potential issues all for the sake of performance and at the cost of correctness. Performance is quite critical in modern data-intensive applications and it seems like very much desired... I am not 100% sure that everything has to be fool-proof. So, I believe it is great that CUDA support for LU allows for pivots=False, and we can do the same for the CPU if the need arises, yes. Especially if the standard requires that.

And I agree that having trans could be redundant and that the API in PyTorch is indeed more intuitive and natural. I wish it was possible to absorb pivots just as seamlessly.

@ogrisel
Copy link
Author

ogrisel commented May 22, 2023

But it's not clear if pytorch' pivot=False is equivalent to scipy permute_l=True or doing something different.

It looks like it's doing the exact same thing.

I am not sure but I cannot quickly check since I do not have access to a cuda GPU at the moment:

By reading the docstring of torch.linalg.lu, I am under the impression that the returned L is always lower triangular with ones on the diagonal, whether or not pivot=False while this is not the case for scipy.linalg.lu: permute_l=True would return the product P @ L which is unlikely to be lower triangular.

EDIT: The fortran code for the scipy LU implementation is here:

@ogrisel
Copy link
Author

ogrisel commented May 22, 2023

Also in the case of randomized_range_finder used internally by scikit-learn's PCA(solver="randomized"), I confirm that we are interested in applying LU with tall matrices for which the permutation matrix P would not fit in memory, hence implicitly precomputing the P @ L product is necessary (otherwise one would get a crash instead of a slowdown).

@lezcano
Copy link
Contributor

lezcano commented May 22, 2023

If you have a tall matrix, what you often want to do is to compute the PLU of the transposed matrix and use it to solve the given system.

Even more, you very rarely want to use linalg.lu. That's a function that's there just for illustrative purposes most of the time. What you want to use is linalg.lu_factor and then linalg.lu_solve. lu_factor returns a permutation vector that represents the full permutation matrix that may then be used in lu_solve. You can also implement your own lu_solve in terms of solve_triangular and scatter. We use this sometimes in PyTorch:
https://github.com/pytorch/pytorch/blob/4f2c007a1b5170c2aa0d47e388ff9e07c7a7d354/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp#L2425-L2463

@lezcano
Copy link
Contributor

lezcano commented May 22, 2023

Also, thank you for digging this up #627 (comment). It looks to mee that we don't want to include the permute_l function, as it's just not useful. The matrix P@L does not have any structure, and it may not be used to solve a system in any way.

@oleksandr-pavlyk
Copy link
Contributor

@lezcano I agree. I think the matrix P should not be returned as a matrix, but rather as a vector encoding the permutation. I would suggest that vector to be the image of range(min(M, N)) under the permutation. That way we could have (L*U)[p[i], j] == X[i,j]. The empty P array would be understood as the identity permutation.

@rgommers
Copy link
Member

Cc @ilayn who is currently making changes to scipy.linalg.lu

@lezcano
Copy link
Contributor

lezcano commented May 22, 2023

@oleksandr-pavlyk I was referring to the output of linalg.lu_factor which returns the permutation encoded as a vector, but not with the usual encoding. linalg.lu_factor is just an interface into BLAS' getrf, so it returns the pivots with the same format as BLAS does.

For these functions, a minimal set of functions from which all the others can be implemented is given by:

  • linalg.lu_factor which would just be modelled after scipy.linalg.lu_factor (which is BLAS' getrf)
  • Something along the lines as linalg.qr_multiply that could be called linalg.lu_multiply (name TBD), with signature lu_multiply(pivots: Array, B: Array, trans: bool = False) that, given the pivots in BLAS' format and a matrix B, it computes P @ B if not trans else P^T @ B

With those two, you can implement lu_solve in terms of them and solve_triangular. You can also implement linalg.lu in terms of these + tril/triu. IMO, I would be +1 to providing linalg.lu_solve even if it's redundant, and I'd suggest not to include linalg.lu.

@ogrisel
Copy link
Author

ogrisel commented May 22, 2023

In the API proposed in #627 (comment), there is a specific function to compute P @ B for an arbitrary B matrix, so that's how you would compute P @ L.

I suspect that this would cause quite a bit of overhead compared to the current API but maybe it's negligible for large enough data.

Now, I do not know a priory why would you need P @ L rather than just L, but that's a story for another day.

It's quite possible that the LU re-normalization trick was written this way (with permute_l=True) to avoid attempting the allocation of P which would make it crash for tall matrices when using the scipy.linalg.lu API but that using the unpermuted L would have worked similarly. I would need to confirm that hypothesis with some numerical experiments.

@tygert
Copy link

tygert commented May 22, 2023

Now, I do not know a priory why would you need P @ L rather than just L, but that's a story for another day.

It's quite possible that the LU re-normalization trick was written this way (with permute_l=True) to avoid attempting the allocation of P which would make it crash for tall matrices when using the scipy.linalg.lu API but that using the unpermuted L would have worked similarly. I would need to confirm that hypothesis with some numerical experiments.

Note that the matrix being re-normalized is often highly rank-deficient. That said, the matrix is also randomized, although the degree of randomization varies. Omitting all pivoting with a rank-deficient matrix for which the randomization may be less than perfect sounds potentially numerically unstable to me. But allocating space for a full permutation matrix (rather than just for the permutation defining it) sounds unnecessarily wasteful to me, yup. Maintaining the pivoting yet without having to form all entries of the full permutation matrix would make sense, perhaps?

@lezcano
Copy link
Contributor

lezcano commented May 23, 2023

Omitting all pivoting with a rank-deficient matrix for which the randomization may be less than perfect sounds potentially numerically unstable to me.

I think a confusion here is coming from the fact that using permute_l=False in scipy seems to hint that it'll perform an LU decomposition without pivoting. This is not the case. As discussed above, there is no way to perform an LU decomposition without pivoting in scipy (and I argue in #627 (comment) that neither should be in this API, at least for now).

With that out of the way, my point is that, when doing an LU decomposition with partial pivoting, L and P @ L will have exactly the same rank (they will both be full rank, regardless of the rank of the initial matrix). Given this, I argue that using L is, a priori, as good a choice as using P@L. This may be done by

X = linalg.lu_factor(A)[0]
L = np.fill_diagonal(np.tril(X), 1)

If you want to keep the information of the rank of the input matrix, you should keep the U, potentially dividing by the non-zero elements of the diagonal. Or even better, you should look into performing an LDU factorisation.

But again, all this part is deviating from the main topic of conversation. The point here is that all these things (permute_l=False/True) with the API proposed in #627 (comment)

@ogrisel
Copy link
Author

ogrisel commented May 23, 2023

Something along the lines as linalg.qr_multiply that could be called linalg.lu_multiply (name TBD), with signature lu_multiply(pivots: Array, B: Array, trans: bool = False) that, given the pivots in BLAS' format and a matrix B, it computes P @ B if not trans else P^T @ B

One way to cleanly implement this would be to standardize a Linear Operator API as part of the Array API spec but I guess this is out-of-scope for now (since it is kind of related to sparse arrays).

For reference, here are some implementations of linear operators in the ecosystem:

@ilayn
Copy link

ilayn commented May 23, 2023

As discussed above, there is no way to perform an LU decomposition without pivoting in scipy

That's because we don't want LU success to be data dependent. LU decomposition is not guaranteed to exist without pivoting. Partial pivoting failures are only possible by carefully designing pathological arrays. So virtually not existing. Full pivoting puts LU into "very expensive" bucket and you'd probably be better off QR anyways. Hence partially pivoting more or less guarantees output. No pivoting makes it a hopeful decomposition.

I don't know what the scope of this array API but decomposition and solving are completely different operations. Julia implements Factorize class to hold the best factorization by analyzing structure of the array and recycles it for operations coming in, later be it a solve or multiply or up/downdate whatever the use case would be. So the action uses the factorization depending on the case. Without that it is going to be a user burden to judge the right decomposition and choosing the right action afterwards. That I would expect the API to hide away from me. You can enforce it with additional keywords etc. but the default way should be the correct way most of the time.

Like I mentioned, lu_solve and lu_factor were historical mistakes in the effort to expose getrf, getrs to Python people as separate steps where you have multiple RHS/LHS and shouldn't have been there. I am hoping that we don't do the same mistake again with new APIs especially exposing LAPACK piv, ipiv swaps that only other LAPACK routines understand, instead of returning meaningful and directly consumable permutation indices.

I think a confusion here is coming from the fact that using permute_l=False in scipy seems to hint that it'll perform an LU decomposition without pivoting

We have an explicit remark about this in docs already but if you still find it confusing I'd like to fix it right away if you have an alternative wording.

@ogrisel
Copy link
Author

ogrisel commented May 23, 2023

I experimented in a branch of scikit-learn to use the new scipy API from the lu_no_fortran branch to return the permutation data as integer indices (as an alternative to precomputing and returning the P @ L product) for the use case of LU-based normalization in randomized SVD following @tygert's paper cited above.

In conclusion:

  • discarding the permutation info is empirically detrimental as @tygert anticipated above,
  • using matrix_p=False (to return an array of indices instead of P) followed by fancy indexing L to apply the permutation is fast enough for our use case in scikit-learn (compared to letting scipy precompute the P @ L product internally using permute_l=True).
  • I also checked that the new Cython + LAPACK code in SciPy for the permute_l=True case was as fast as the old fortran code.

In my opinion, the future standard xp.linalg.lu API should:

  • always return the permutation info as an array of indices for the sake of memory usage,
  • not expose the permute_l=True option of scipy because it can be quickly reproduced via fancy indexing,
  • not expose the pivot=False option of PyTorch because it can fail in a data dependent way and not all libraries will be willing to implement this option,
  • always return 3 arrays (non polymorphing API).

Here is a summary of the typical snippet following the API I would expect in the standard (if we ignore any API backward compatibility concerns):

X = xp.asarray(np.random.randn(42, 43))
P_indices, L, U = xp.linalg.lu(X)
assert xp.linalg.norm(X - L[P_indices] @ U) < 1e-12

This means that scipy could implement this with its new matrix_p option but not with its default value. But maybe numpy.array_api could.

Not sure how to deal with backward compat for pytorch either.

@lezcano
Copy link
Contributor

lezcano commented May 23, 2023

It is true that the lu_multiply proposed in #627 (comment) can be implemented in terms of advanced indexing (or more generally, a torch.scatter in the batched case). Now, what I don't like of the API proposed in #627 (comment) is that it doesn't allow for the most common use case of this API which is calling a lu_factor and a lu_solve to solve a system of linear equations (or more than one when they share their lhs). This is rather typical when solving matrix ODEs numerically, for example.

I would propose then to just have one function:
lu_factor(X: Array, p_indices: bool = False) -> Tuple[Array, Array]
The semantics of this function would be:

  • When called with p_indices=False, it would return the matrix and the pivots returned by BLAS
  • When called with p_indices=True, the matrix would still be that returned by BLAS, but the pivots would define a 0-indexed permutation, so that they can be used as indices within an advanced indexing operation.

Note that p_indices is what scipy/scipy#18358 calls matrix_p. Choose whichever name you like best.

Uses of this API:

  • The most obvious one: Its output can be fed into lu_solve as-is to implement a linalg.solve
  • One may extract the L returned by scipy.linalg.lu doing as discussed in RFC: add support for LU factorization in the linalg extension #627 (comment). The U may be got by simply calling triu on the output of this function.
  • If you set p_indices=True, you may compute P @ B for an arbitrary matrix B by doing B[perm] where perm is the second array (vector in the non-batched case) that this function returns
  • As corollary of the previous point, one may compute the matrix P explicitly by doing P = xp.eye(n)[perm]
  • If one wants to compute P.T @ B for a given matrix (as you need to do to compute the adjoint of lu_solve) one can do that noting that the inverse permutation of a permutation vector perm is given by argsort(perm)

@ilayn
Copy link

ilayn commented May 23, 2023

Since you folks are working with insane amount of data and tools I can't even begin to understand to handle that data, just a few remarks about the speed of LU or any other linalg stuff in SciPy

  • Quite some performance is lost in check_finite across the board. If you have alternative ways to guarantee finiteness then you'll get much more out of that implementation.
  • Another performance eater is the Fortran-nannying; swapping the C-memory layout to Fortran-layout and back. I tried my best to keep the number of copy operations to 2 (one to lapack one from lapack) but I can't claim expertise. Hence probably I still left quite a bit of performance on the table there. We are working on getting rid of this Fortran ping-pong.
  • There is nothing magical happening in getrf and I would cautiously encourage for alternative getrf implementations for distributed cases. I will try to make that happen somewhere this summer to reclaim that performance with a pure C-layout implementation.
  • The permutation is not necessarily needed to be pushed into the array since it shuffles the memory order and costs performance especially if a copy is created. Instead a permuted solver can go a long way. LAPACK doesn't offer this but as per previous point, rows are contiguous on C-layout and offers quite a nice cache-friendly operations in LU case.

That is to say, instead of

Bp = B[p, :]

for r in range(n):
    foo = Bp[r, :] @ bla

you can do

for r in range(n):
    foo = B[p[r], :] @ bla

without any data motion on B (trivial example but I hope it conveys the message).

@ilayn
Copy link

ilayn commented May 23, 2023

Note that p_indices is what scipy/scipy#18358 calls matrix_p. Choose whichever name you like best.

While I have the audience, if you have a better naming please do so, we considered

p_indices
perm_as_1d
p_as_1d
p_as_vec
perm_vector
return_1d
return_vector

and many other ugly candidates. We went with matrix_p out of desperation and to rhyme with permute_l

@ogrisel
Copy link
Author

ogrisel commented May 24, 2023

I think I find p_indices more explicit. Or maybe p_as_indices to avoid confusion with a local variable that could be used to receive the returned integer array. No strong opinion though.

@ogrisel
Copy link
Author

ogrisel commented May 24, 2023

@lezcano I am not sure I understand your proposal in: #627 (comment), in particular "pivots returned by BLAS". Could you please illustrate how your API proposal would feel from an end-user perspective?

@lezcano
Copy link
Contributor

lezcano commented May 24, 2023

My proposal accounts for having

# p_indices=False
assert xp.linalg.lu_factor(X) == scipy.linalg.lu_factor(X)

# p_indices=True
A, perm = xp.linalg.lu_factor(X, p_indices=True)
assert A == scipy.linalg.lu_factor(X)[0]
assert perm = scipy.linalg.lu(X, matrix_p=True)[2]

@ogrisel
Copy link
Author

ogrisel commented May 25, 2023

@lezcano thanks but alone this is not very user friendly. I supposed users need extra steps to retrieve the L and U components from A using xp.tril and xp.triu with the correct value for the diagonal offset parameter k.

So the full snippet would be:

# p_indices=False
A, P = xp.linalg.lu_factor(X)
L, U = xp.tril(A), xp.triu(A, k=1)
assert xp.linalg.norm(X - P @ L @ U) < 1e-12

or:

# p_indices=True
A, p_indices = xp.linalg.lu_factor(X, p_indices=True)
L, U = xp.tril(A), xp.triu(A, k=1)
assert xp.linalg.norm(X - L[p_indices] @ U) < 1e-12

I find the original suggestion in #627 (comment) way more user-friendly. Maybe we could offer both, that is still expose the lower level lu_factor API with the combined A = L + U in cases where the extra memory usage is concern.

Note, that if there is a concern with backward compat with existing libraries and default parameters, it's possible to use a different function name (e.g. lu_decompose).

Also: what is the stance of the Array API standardization committee on "soft" polymorphic functions where the number of array element in a tuple of results does not dependent on the value of the function argument but the dtype and ndim of such arrays do (as would be the case with the p_indices kwargs). Wouldn't that be a problem for libraries such as jax where the JIT compiler would possibly fail on such softly polymorphic outputs?

@lezcano
Copy link
Contributor

lezcano commented May 25, 2023

The full snipped would look like:

# p_indices=False: Solve a linear system
A, P = xp.linalg.lu_factor(X)
sol = xp.linalg_solve(A, P, Y)
assert xp.linalg.norm(X @ sol - Y) < 1e-12

# Reconstruct the full PLU decomposition
A, p_indices = xp.linalg.lu_factor(X, p_indices=True)
L, U = xp.fill_diagonal(xp.tril(A), 1), xp.triu(A)
assert xp.linalg.norm(X - L[p_indices] @ U) < 1e-12

The idea behind this API is based in three principles but here is where I may be wrong and it may be helpful for other members of the standarisation committee to chip in:

  • In almost all cases, this function will be used through the p_indices=False API to solve linear systems or compute a determinant.
  • The API should be easily implementable for all libraries (CPU / CUDA) with the current tools (most libraries rely on Blas / cuBLAS / Magma implementations)
  • The API should be expressive enough yet compact.

As discussed above, the proposed xp.linalg.lu_factor allows for implementing both scipy.linalg.lu_factor and scipy.linalg.lu using just one function.

I think the conciseness point here is important, as in PyTorch we used to have lu_factor, and then lu_unpack. This second one would unpack the L, U and the pivots. In my opinion, integrating the unpacking of the returned permutation into a set of indices (i.e. a permutation as defined in mathematics) provides for an even more concise API.

find the original suggestion in #627 (comment) way more user-friendly.

The suggestion in that comment does not account for the most common case, which is using the LU decomposition to solve a linear system.

When it comes to the user-friendliness, I believe that the array API is not in the business of offering plenty of auxiliary functions, but to have enough expressibility so that common constructs can be expressed efficiently in it. Auxiliary functions could potentially go into a library that builds on top of the array API. For example, using this explicit construction in your experiments in ogrisel/scikit-learn#16 would save you from instantiating U explicitly, halving the peak memory and saving one full read and write over the data, and potentially allowing you to perform the computation of L in place, saving even more memory and one further allocation.

When it comes to efficiency, note that using the NumPy API (that is, without fusion of operations) it's not possible to do better efficency-wise than the described tril + diagonal_fill + triu when it comes to computing the L and the U explicitly.

Note, that if there is a concern with backward compat with existing libraries and default parameters,

The proposed API is BC with other libraries, as its defaults are the same as those from SciPy and PyTorch

Also: what is the stance of the Array API standardization committee on "soft" polymorphic functions where the number of array element in a tuple of results does not dependent on the value of the function argument but the dtype and ndim of such arrays do (as would be the case with the p_indices kwargs)

p_indices does not change the shape of the output. It simply performs an iterative algorithm on the indices. Here's the algorithm as implemented in PyTorch:
https://github.com/pytorch/pytorch/blob/4882cd08013733a5dbe299871ad7e974bce074b3/aten/src/ATen/native/BatchLinearAlgebraKernel.cpp#L1119-L1128

@ogrisel
Copy link
Author

ogrisel commented May 25, 2023

Thanks for the explicit summary.

p_indices does not change the shape of the output.

I am not sure what you mean by shape. I would expect the following:

>>> A, p_indices = xp.linalg.lu_factor(X, p_indices=True)
>>> p_indices.dtype.kind
'i'
>>> p_indices.ndim
1

while:

>>> A, P = xp.linalg.lu_factor(X, p_indices=False)
>>> P.dtype.kind
'f'
>>> P.ndim
2

this is what I meant by "soft polymorphic outputs".

@lezcano
Copy link
Contributor

lezcano commented May 25, 2023

When p_indices=False, xp.linalg.lu_factor would return the same thing as scipy.linalg.lu_factor. In particular, it returns a 1D indexed array that can be fed into lu_solve.

If you want to recover the full 2D matrix P, you would do as discussed in the penultimate point in #627 (comment)

[...] one may compute the matrix P explicitly by doing P = xp.eye(n)[perm]

where perm = xp.linalg.lu_factor(X, p_indices=True)[1]

@ogrisel
Copy link
Author

ogrisel commented May 30, 2023

I now realize that I misunderstood how the piv arrays returned by LAPACK actually work. The example in the docstring of scipy.linalg.lu_factor seems to be broken.

I find that exposing those low level 1-based indexed pivot info quite confusing anyway. I have the feeling this should not be part of the standard API or at least not with the default options.

@ilayn
Copy link

ilayn commented May 30, 2023

I can fix the docs if something is broken. What kind of issue do you have with it?

And about exposing it, yes that's what I meant in the last part of #627 (comment) about the "mistake". Note that they are pivot "swaps" on A not permutations on L.

@lezcano
Copy link
Contributor

lezcano commented May 30, 2023

Which API would you guys then propose that can be implemented with the current tools (BLAS/cuBLAS/cuSolver/MAGMA) that allows for solving linear systems fast, possibly with a common lhs?

@ilayn
Copy link

ilayn commented May 30, 2023

My vote goes to Factorize class like Julia or Rust implements. The name can be something else but polyalgorithm is a must in this day and age for these operations. lu_factor, chol_solve etc in my opinion belong to the fortran age.

If you have an array A and pass it to Factorize or whatever the name, then you store your array factorization depending on the properties of A. It can be discovered at load time, or can be enforced as Factorize(A) or Factorize(A, assume_a='sym'). Then you can call solve on it and it will use whatever factorization is stored internally. You can repeat solve with different RHS or can perform different ops like .update(), .downdate(), det, rcond() in a lazy loading way. It can also hold multiple factorizations.

If you need explicit factorization then you can still call lu, qr and so on. That's also sparse and dense agnostic and pretty much what matlab offers as the most famous convenience. You don't focus on the operation itself you focus on the data you provide and the algorithms handle the complication.

Otherwise we will repeat the same fortran api yet another two decades. That's pretty much what I wanted to do in scipy/scipy#12824 but then I lost my way due to other things. It tries to address the discovery part for solve. The idea is that the linalg APIs have moved on and it is quite inconvenient to fight with lapack api just to get a single operation with internal arrays, convert things from internal lapack representation to usable Scientific Python ways etc.

We are now trying to reduce our Fortran77 codebase in SciPy which is taking quite some time and after that I am hoping to get back to this again. Anyways, that's a rough sketch but that's kind of my wish to have in the end.

@ogrisel
Copy link
Author

ogrisel commented May 31, 2023

I can fix the docs if something is broken. What kind of issue do you have with it?

I thought the example in the docstring was broken (I thought I had copied and pasted the snippets and it would fail) but it's actually not the case. Still I think it can be helpful to be more explicit: scipy/scipy#18594.

For some reason I thought that I read somewhere that piv would be 1-based indexed (as in LAPACK) but it's actually 0-based indexed in scipy.

@kgryte kgryte changed the title LU factorization in the linalg extension RFC: add support for LU factorization in the linalg extension Apr 4, 2024
@kgryte kgryte added RFC Request for comments. Feature requests and proposed changes. Needs Discussion Needs further discussion. labels Apr 4, 2024
@ogrisel
Copy link
Author

ogrisel commented May 7, 2024

I find the proposal in #627 (comment) interesting. Would it be possible to implement it on top of the existing building blocks available in numpy/scipy, torch and co?

If it can be done, (e.g. maybe in a proof of concept repo), maybe it can help move the spec discussion forward if it is backed by working code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Discussion Needs further discussion. RFC Request for comments. Feature requests and proposed changes. topic: Linear Algebra Linear algebra.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants