Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Element-wise +, -, *, / for BoltArraySpark #94

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

kr-hansen
Copy link

Added basic element-wise addition (+), subtraction (-), multiplication (*), and division (/) to BoltSparkArrays.

Method used is _rdd().join().mapValues(). Attempted to play around with _rdd().cogroup().reduceByKey() but cogroup() returns interables and I was not familiar enough with them to spend the time troubleshooting other methods for a time efficiency comparison. May be a better method to implement these operations, but I went with the best I knew how.

I didn't include any of the reverse functions (radd etc) as I wasn't sure how it would behave/interact with local array functions if there was an attempt to list the local array first (local + Spark). There are no issues as long as it is always two arrays being added together and the Spark array is listed first (Spark + local).

I also didn't add support for adding/subtracting/multiplying/dividing by a constant integer across the whole array. I wasn't sure if this is functionality we would want to add, as it can be done using map(). My personal opinion is to require the user to use map() because I feel it forces the user to remember they are working on distributed arrays. However, this functionality could be easily implemented into the existing functions with some if statements in checking for integers and doing the map() under the hood.

On a similar note to above, these functions could also be changed to not use the symbolic values (+, -, *, /) but be calls required in a map format (.add(), .subtract(), multiply(), divide()).

Finally, if there is an intention to add many operations for distributed BoltSparkArrays similar to all the functions for python ndarrays, perhaps an independent class or .py file should be created to contain these operations that could then be called? I wasn't sure of the long-term intentions of what types of functions there is an interest to integrate into BoltArrays.

Tried to include similar safety catches as concatenate().

For division, true division from future is always performed.

Basic troubleshooting was performed without any major issues, but more thorough troubleshooting may be useful.

Include elements of Spark Arrays that allow for element-wise addition, subtraction, multiplication, and division.
Fixed __truediv__ functionality.  Submitted this as a patch to the main branch with the following note:

Added basic element-wise addition (+), subtraction (-), multiplication (*), and division (/) to BoltSparkArrays.

Method used is `_rdd().join().mapValues()`.  Attempted to play around with `_rdd().cogroup().reduceByKey()` but `cogroup()` returns interables and I was not familiar enough with them to spend the time troubleshooting other methods for a time efficiency comparison.  May be a better method to implement these operations, but I went with the best I knew how.

I didn't include any of the reverse functions (__radd__ etc) as I wasn't sure how it would behave/interact with local array functions if there was an attempt to list the local array first (`local + Spark`). There are no issues as long as it is always two arrays being added together and the Spark array is listed first (`Spark + local`).

I also didn't add support for adding/subtracting/multiplying/dividing by a constant integer across the whole array.  I wasn't sure if this is functionality we would want to add, as it can be done using `map()`.  My personal opinion is to require the user to use `map()` because I feel it forces the user to remember they are working on distributed arrays.  However, this functionality could be easily implemented into the existing functions with some if statements in checking for integers and doing the `map()` under the hood.

On a similar note to above, these functions could also be changed to not use the symbolic values (+, -, *, /) but be calls required in a map format (`.add()`, `.subtract()`, `multiply()`, `divide()`).

Finally, if there is an intention to add many operations for distributed BoltSparkArrays similar to all the functions for python ndarrays, perhaps an independent class or .py file should be created to contain these operations that could then be called?  I wasn't sure of the long-term intentions of what types of functions there is an interest to integrate into BoltArrays.

Tried to include similar safety catches as `concatenate()`.

For division, true division from __future__ is always performed.

Basic troubleshooting was performed without any major issues, but more thorough troubleshooting may be useful.
@jwittenbach
Copy link
Contributor

jwittenbach commented May 23, 2016

@kkcthans This is great!

One thought: all of these functions end up having the same boilerplate at the beginning. One nice way to organize this might be to have a generic self.elementwise_binary(f) function that takes an arbitrary function that does an element-wise binary operation on two ndarrays. This __add__, __subtract__, etc could simply call out to this with the right operation.

@kr-hansen
Copy link
Author

Ya, I noticed that when I saw this was all ready implemented within Thunder/base.py.

@freeman-lab said he was going to look at this pull request and talk with you about how to handle its implementation in addition to code all ready in Thunder (Decide what implementations should go where).

I'd be happy to change it to that more generic element_wise function after you guys decide what will be best for the projects and implement it wherever. I'll change it and update appropriately after you guys decide where you want it.

@freeman-lab
Copy link
Member

freeman-lab commented May 24, 2016

Just looked this over with @jwittenbach, it's looking really good! We should definitely put this in bolt, and can then add appropriate wrappers in thunder in another PR. Only two comments (1) it'd be great to leverage a more generic element_wise function here to minimize code duplication and (2) it'd be great if you could add a couple unit tests for these new functions.

Look at what's in https://github.com/bolt-project/bolt/blob/master/test/spark/test_spark_basic.py to see how to structure them. You can run the tests on your own machine just by calling py.test, though note that it requires a local Spark installation. Also note that if you just add some tests to that file it will run them online via travis as soon as you update this PR.

Let us know if you run into trouble!

@kr-hansen
Copy link
Author

I'll look at implementing both of these and do that to update the code

@kr-hansen
Copy link
Author

kr-hansen commented May 25, 2016

Does it matter too much where I put the tests? I was thinking of creating a new test, something like test_spark_binaryoperators.py. Or would it be better if I just append them to the end of the test_spark_basic.py, or even test_spark_functional.py? What is your preference in terms of organization for the long run?

@kr-hansen
Copy link
Author

kr-hansen commented May 25, 2016

Ok @freeman-lab & @jwittenbach , I just added a test_spark_binaryoperators.py file to the test location. It looks like it worked.

I added an elementwise function and tried to make sure it included appropriate checks to work with a BoltSparkArray, BoltLocalArray, or ndarray.

One thing to note (probably in the documentation or something) unless we want to change this to make it more uniform, is that the output array will take the format of the first array. For example with:
c = a + b
c will be a BoltSparkArray if a is a BoltSparkArray, independent of what b is. However, if a is a BoltLocalArray, then c will be a BoltLocalArray, independent of what b is.

Let me know if there are any other things you guys feel I should tweak with it for this pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants