Skip to content

Releases: thunder-project/thunder

Maintenance release

28 Jul 15:49
Compare
Choose a tag to compare

This a maintenance release of Thunder.

The main focus is fixing a variety of deployment and installation related issues, and adding initial support for the recently released Spark 1.4. Thunder has not been extensively used alongside Spark 1.4, but with this release all core functionality has been verified.

Changes and Bug Fixes


  • Fix launching error when starting Thunder with Spark 1.4 (addresses #201)
  • Fix EC2 deployment with Spark 1.4
  • More informative errors for handling import errors on startup
  • Remove pylab when starting notebooks on EC2
  • Improved dependency handling on EC2
  • Updated documentation for factorization methods

Contributions


If you have any questions come chat with us, and stay tuned for Thunder 0.6.0 in the near future.

Fifth development release

02 Apr 18:59
Compare
Choose a tag to compare

We are pleased to announce the release of Thunder 0.5.0. This release introduces several new features, including a new framework for image registration algorithms, performance improvements for core data conversions, improved EC2 deployment, and many bug fixes. This release requires Spark 1.1.0 or later, and is compatible with the most recent Spark release, 1.3.0.

Major features


  • A new image registration API inside the new thunder.imgprocessing package. See the tutorial.
  • Significant performance improvements to the Images to Series conversion, including a Blocks object as an intermediate stage. The inverse conversion, from Series back to Images, is now supported.
  • Support for tiff image files as an input format has been expanded and made more robust. Multiple image volumes can now be read from a single input file via the nplanes argument in the loading functions, and files can be read from nested directory trees using the recursive=True flag.
  • New methods for working with mutli-level indexing on Series objects, including selectByIndex and seriesStatByIndex, see the tutorial.
  • Convenient new getter methods for extracting Individual records or small sets of records using bracket notation, as in Series[(x,y,z)] or Images[k].
  • A new serializable decorator to make it easy to save/load small objects (e.g. models) to JSON, including handling of numpy arrays. See saving/loading of RegistrationModel for an example.

Minor features


  • Parameter files can be loaded from a file with simple JSON schema (useful for working with covariates), using ThunderContext.loadParams
  • A new method ThunderContext.setAWSCredentials handles AWS credential settings in managed cluster environments (where it may not be possible to modify system config files)
  • An Images object can be saved to a collection of binary files using Images.saveAsBinaryImages
  • Data objects now have a consistent __repr__ method, displaying uniform and informative results when these objects are printed.
  • Images and Series objects now each offer a meanByRegions() method, which calculates a mean over one or more regions specified either by a set of indices or a mask image.
  • TimeSeries has a new convolve() method.
  • The thunder and thunder-submit executables have been modified to better expose the options available in the underlying pyspark and spark-submit Spark executable scripts.
  • An improved and streamlined Colorize with new colorization options.
  • Load data hosted by the Open Connectome Project with the loadImagesOCP method.
  • New example data sets available, both for local testing and on S3
  • New tutorials: regression, image registration, multi-level indexing

Transition guide


  • Some keyword parameters have been changed for consistency with the Thunder style guide naming conventions. Example are inputformat, startidx, and stopidx parameters on the ThunderContext loading methods, which are now inputFormat, startIdx, and stopIdx, respectively. We expect minimal future changes in existing method and parameter names.
  • The Series methods normalize() and detrend() have been moved to TimeSeries objects, which can be created by the Series.toTimeSeries() method.
  • The default file extension for the binary stack format is now bin instead of stack. If you need to load files with the stack extension, you can use the ext='stack' keyword argument of loadImages.
  • export is now a method on the ThunderContext instead of a standalone function, and now supports exporting to S3.
  • The loadImagesAsSeries and convertImagesToSeries methods on ThunderContext now default to shuffle=True, making use of a revised execution path that should improve performance.
  • The method for loading example data has been renamed from loadExampleEC2 to loadExampleS3

Deployment and development


  • Anaconda is now the default Python installation on EC2 deployments, as well as on our Travis server for testing.
  • EC2 scripts and unit tests provide quieter and prettier status outputs.
  • Egg files now included with official releases, so that a pip install of thunder-python can immediately be deployed on a cluster without cloning the repo and building an egg.

Contributions:


  • Andrew Osheroff (data getter improvements)
  • Ben Poole (optimized window normalization, image registration)
  • Jascha Swisher (images to series conversion, serializable class, tif handling, get and meanBy methods, bug fixes)
  • Jason Wittenbach (new series indexing functionality, regression and indexing tutorials, bug fixes)
  • Jeremy Freeman (image registration, EC2 deployment, exporting, colorizing, bug fixes)
  • Kunal Lillaney (loading from OCP)
  • Michael Broxton (serializable class, new series statistics, improved EC2 deployment)
  • Noah Young (improved EC2 deployment)
  • Tom Sainsbury (image filtering, PNG saving options)
  • Uri Laseron (submit scripts, Hadoop versioning)

Roadmap


Moving forward we will do a code freeze and cut a release every three months. The next will be June 30th.

For 0.6.0 we will focus on the following components:

  • A source extraction / segmentation API
  • New capabilities for regression and GLM model fitting
  • New image registration algorithms (including volumetric methods)
  • Latent factor and network models
  • Improved performance on single-core workflows
  • Bug fixes and performance improvements throughout

If you are interested in contributing, let us know! Check out the existing issues or join us in the chatroom.

Maintenance release

04 Nov 06:52
Compare
Choose a tag to compare

We are happy to announce the 0.4.1 release of Thunder. This is a maintenance / bug fix release.

The focus is ensuring consistent array indexing across all supported input types and internal data formats. For 3D image volumes, the z-plane will now be on the third array axis (e.g. ary[:,:,2]), and will be in the same position for Series indices and the dims attribute on Images and Series objects. Visualizing image data by matplotlib’s imshow() function will yield an image in the expected orientation, both for Images objects and for the arrays returned by a Series.pack() call. Other changes described below.

Changes and Bug Fixes


  • Handling of wildcards in path strings for the local filesystem and S3 is improved.
  • New Data.astype method for converting numerical type of values.
  • A dtype parameter has been added to the ThunderContext.load* methods.
  • Several exceptions thrown by uncommon edge cases in tif handling code have been resolved.
  • The Series.pack() method no longer automatically casts returned data to float16. This can instead be performed ahead of time using the new astype methods.
  • tsc.convertImagesToSeries() did not previously write output files with tif file input when shuffle=True.
  • A ValueError thrown by the random sampling methods with numpy 1.9 has been resolved (issue #41).
  • The thunder-ec2 script will now generate a ~/.boto configuration file containing AWS access keys on all nodes, allowing workers to access S3 with no additional configuration.
  • Test example data files are now copied out to all nodes in a cluster as part of the thunder-ec2 script.
  • Now compatible with boto 2.8.0 and later versions, for EC2 deployments (issue #40).
  • Fixed a dimension bug when colorizing 2D images with the indexed conversion type.
  • Fixed an issue with optimization approach being misspecified in colorization.

Thanks


  • Joseph Naegele: reporting path and data type bugs
  • Allan Wong: reporting random sampling bug
  • Sung Soo Kim: reporting colorization optimization issue
  • Thomas Sainsbury: reporting indexed colorization bug

Contributions


Thanks very much for your interest in Thunder. Questions and comments can be set to the mailing list.

Fourth development release

16 Oct 05:35
Compare
Choose a tag to compare

We are pleased to announce the release of Thunder 0.4.0.

This release introduces some major API changes, especially around loading and converting data types. It also brings some substantial updates to the documentation and tutorials, and better support for data sets stored on Amazon S3. While some big changes have been made, we feel that this new architecture provides a more solid foundation for the project, better supporting existing use cases, and encouraging contributions. Please read on for more!

Major Changes

  • Data representation. Most data in Thunder now exists as subclasses of the new thunder.rdds.Data object. This wraps a PySpark RDD and provides several general convenience methods. Users will typically interact with two main subclasses of data, thunder.rdds.Images and thunder.rdds.Series, representing spatially- and temporally-oriented data sets, respectively. A common workflow will be to load image data into an Images object and then convert it to a Series object for further analysis, or just to convert Images directly to Series data.
  • Loading data. The main entry point for most users remains the thunder.utils.context.ThunderContext object, available in the interactive shell as tsc, but this class has many new, expanded, or renamed methods, in particular loadImages(), loadSeries(), loadImagesAsSeries(), and convertImagesToSeries(). Please see the Thunder Context tutorial and the API documentation for more examples and detail.
  • New methods for manipulating and processing images and series data, including refactored versions of some earlier analyses (e.g. routines from the package previously known as timeseries).
  • Documentation has been expanded, and new tutorials have been added.
  • Core API components are now exposed at the top-level for simpler importing, e.g. from thunder import Series or from thunder import ICA
    Improved support for loading image data directly from Amazon S3, using the boto AWS client library. The load* methods in ThunderContext now all support s3n:// schema URIs as data path specifiers.

Notes about requirements and environments

  • Spark 1.1.0 is required. Most functionality will be intact with earlier versions of Spark, with the exception of loading flat binary data.
  • “Hadoop 1” jars as packaged with Spark are recommended, but Thunder should work fine if recompiled against the CDH4, CDH5, or “Hadoop 2” builds.
  • Python 2 required, version 2.6 or greater.
  • PIL/pillow libraries are used to handle tif images. We have encountered some issues working with these libraries, particularly on OSX 10.9. Some errors related to image loading may be traceable to a broken PIL/pillow installation.
  • This release has been tested most extensively in three environments: local usage, a private research compute cluster, and Amazon EC2 clusters stood up using the thunder-ec2 script packaged with the distribution.

Future Directions

Thunder is still young, and will continue to grow. Now is a great time to get involved! While we will try to minimize changes to the API, it should not yet be considered stable, and may change in upcoming releases. That said, if you are using or contemplating using Thunder in a production environment, please reach out and let us know what you’re working on, or post to the mailing list.

Contributors

Jascha Swisher (@industrial-sloth): loading functionality, data types, AWS compatibility, API
Jeremy Freeman (@freeman-lab): API, data types, analyses, general performance and stability

Maintenance release

11 Sep 02:36
Compare
Choose a tag to compare

This release includes bug fixes and other minor improvements.

Bug fixes

  • Removed pillow dependency, to prevent a bug that appears to occur frequently in Mac OS 10.9 installations (87280ec)
  • Customized EC2 installation and configuration, to avoid using Anaconda AMI, which was failing to properly configure mounted drives (fixes #21)

Improvements

  • Handle either zero- or one-based indexing in keys (#20)
  • Support requester pays bucket setting for example data (fixes #21)

Maintenance release

04 Sep 06:15
Compare
Choose a tag to compare

Maintenance release with bug fixes and minor improvements.

Bug fixes

  • Fixed error specifying path to shell.py in pip installations
  • Fixed a broken import that prevented use of Colorize

Improvements

  • Query returns average keys as well as average values
  • Loading example data from EC2 supports "requester pays" mode
  • Fixed documentation typos (#19)

Third development release

23 Aug 22:04
Compare
Choose a tag to compare

This update adds new functionality for loading data, alongside changes to the API for loading, and a variety of smaller bug fixes.

API changes

  • All data loading is performed through the new Thunder Context, a thin wrapper for a Spark Context. This context is automatically created when starting thunder, and has methods for loading data from different input sources.
  • tsc.loadText behaves identically to the load from previous versions.
  • Example data sets can now be loaded from tsc.makeExample, tsc.loadExample, and tsc.loadExampleEC2.
  • Output of the pack operation now preserves xy definition, but outputs will be transposed relative to previous versions.

New features

  • Include design matrix with example data set on EC2
  • Faster nmf implementation by changing update equation order (#15)
  • Support for loading local MAT files into RDDs through tsc.loadMatLocal
  • Preliminary support for loading binary files from HDFS using tsc.loadBinary (depends on features currently only available in Spark's master branch)

Bug fixes

  • Used pillow instead of PIL (#11)
  • Fixed important typo in documentation page (#18)
  • Fixed sorting bug in local correlations

Second development release

04 Aug 06:45
Compare
Choose a tag to compare

This is a significant update with changes and enhancements to the API, new analyses, and bug fixes.

Major changes

  • Updated for compatibility with Spark 1.0.0, which brings with it a number of significant performance improvements
  • Reorganization of the API such that all analyses are all accessed through their respective classes and methods (e.g. ICA.fit, Stats.calc). Standalone functions use the same classes, and act as wrappers soley for non-interactive job submission (e.g. thunder-submit factorization/ica <opts>)
  • Executables included with the release for easily launching a PySpark shell, or an EC2 cluster, with Thunder dependencies and set-up handled automatically
  • Improved and expanded documentation, built with Sphinx
  • Basic functionality for colorization of results, useful for visualization, see example
  • Registered project in PyPi

New analyses and features

  • A DataSet class for easily loading simulated and real data examples
  • A decoding package and MassUnivariateClassifier class, currently supporting two mass univariate classification analyse (GaussNaiveBayes and TTest)
  • An NMF class for dense non-negative matrix factorization, a useful analysis for spatio-temporal decompositions

Bug fixes and other changes

  • Renamed sigprocessing library to timeseries
  • Replace eig with eigh for symmetric matrix
  • Use set and broadcasting to speed up filtering for subsets in Query
  • Several optimizations and bug fixes in basic saving functionality, including new pack function
  • Fixed handling of integer indices in subtoind

First development release

08 Jan 06:28
Compare
Choose a tag to compare

First development release, highlighting newly refactored four analysis packages (clustering, factorization, regression, and sigprocessing) and more extensive testing and documentation

Release notes:

General
Preprocessing an optional argument for all analysis scripts
Tests for accuracy for all analyses

Clustering
Max iterations and tolerance optional arguments for kmeans

Factorization
Unified singular value decomposition into one function with method option ("direct" or "em")
Made max iterations and tolerance optional arguments to ICA
Added random seed argument to ICA to faciliate testing

Regression
All functions use derivatives of a single RegressionModel or TuningModel class
Allow input to RegressionModel classes to be arrays or tuples for increased flexibility
Made regression-related arguments to tuning optional arguments

Signal processing
All functions use derivatives of a single SigProcessMethod class
Added crosscorr function

Thanks to many contributions from @JoshRosen!