Update documentations

thunder-project · Aug 23, 2014 · 4783f5a · 4783f5a
1 parent 045d8b7
commit 4783f5a
Show file tree

Hide file tree

Showing 5 changed files with 32 additions and 25 deletions.
diff --git a/README.md b/README.md
@@ -34,9 +34,8 @@ thunder is designed to run on a cluster, but local testing is a great way to lea
 3) Start thunder from the terminal
 
 	thunder
-	>> from thunder.utils import DataSets
 	>> from thunder.factorization import ICA
-	>> data = DataSets.make(sc, "ica")
+	>> data = tsc.makeExample("ica")
 	>> model = ICA(c=2).fit(data)
 
 To run in iPython, just set this environmental variable before staring:
@@ -55,26 +54,26 @@ We also include a script for launching an Amazon EC2 cluster with thunder presin
 Analyses
 --------
 
-thunder currently includes five packages: classification (decoding), clustering, factorization, regression, and timeseries, as well as an utils package for loading and saving (see Input format and Output format) and other utilities (e.g. matrix operations). Scripts can be used to run standalone analyses, but the underlying classes and functions can be used from within the PySpark shell for easy interactive analysis.
+thunder currently includes five packages: classification (decoding), clustering, factorization, regression, and timeseries, as well as utilities for loading and saving data and basic visualization. Scripts can be used to run standalone analyses, but the underlying classes and functions can be used from within the PySpark shell for easy interactive analysis.
 
 Input and output
 ----------------
 
-thunder is built around a commmon input format for raw neural data: a set of signals as key-value pairs, where the key is an identifier, and the value is a response time series. In imaging data, for example, each record would be a voxel or an ROI, the key an xyz coordinate, and the value a flouresence time series. This is a useful representation because most analyses parallelize across neural signals (i.e. across records). 
+thunder is built around a commmon input format for time series data: a set of signals or channels as key-value pairs, where the key is an identifier, and the value is a time series. In neural imaging data, for example, each record would be a voxel or an ROI, the key an xyz coordinate, and the value a flouresence time series.
 
-These key-value records can, in principle, be stored in a variety of cluster-accessible formats, and it does not affect the core functionality (besides loading). Currently, the loading function assumes a text file input, where the rows are neural signals, and the columns are the keys and values, each number separated by space. Support for flat binary files is coming soon.
+These key-value records can be derived from variety of cluster-accessible formats. thunder currently includes methods for loading data from text or flat binary files stored locally, in HDFS, or on a networked file system, and preliminary support for importing and converting data from other formats.
 
 All metadata (e.g. parameters of the stimulus or behavior for regression analyses) can be provided as numpy arrays or loaded from MAT files, see relavant functions for more details.
 
-Results can be visualized directly from the python shell ir iPython notebook, or saved as MAT files, text files, or images.
+Results can be visualized directly from the python shell or in iPython notebook, or saved as images, numpy files, or MAT files. Other output formats coming soon. 
 
 Road map
 ----------------
 If you have other ideas or want to contribute, submit an issue or pull request!
 
-- New file formats for input data
-- Automatic extract-transform-load for different raw formats (e.g. raw images)
+- Integrate more scikit learn functionality
 - Analysis-specific visualizations
+- Input format support: HD5, tif
+- Port versions of most common workflows to scala
 - Unified metadata representation
 - Streaming analyses
-- Port versions of most common workflows to scala
diff --git a/python/README.rst b/python/README.rst
@@ -42,9 +42,8 @@ thunder is designed to run on a cluster, but local testing is a great way to lea
 :: 
 
 	thunder
-	>> from thunder.utils import DataSets
 	>> from thunder.factorization import ICA
-	>> data = DataSets.make(sc, "ica")
+	>> data = tsc.makeExample("ica")
 	>> model = ICA(c=2).fit(data)
 
 To run in iPython, just set this environmental variable before staring:
@@ -74,21 +73,21 @@ thunder currently includes five packages: classification (decoding), clustering,
 Input and output
 ----------------
 
-thunder is built around a commmon input format for raw neural data: a set of signals as key-value pairs, where the key is an identifier, and the value is a response time series. In imaging data, for example, each record would be a voxel or an ROI, the key an xyz coordinate, and the value a flouresence time series. This is a useful representation because most analyses parallelize across neural signals (i.e. across records). 
+thunder is built around a commmon input format for time series data: a set of signals or channels as key-value pairs, where the key is an identifier, and the value is a time series. In neural imaging data, for example, each record would be a voxel or an ROI, the key an xyz coordinate, and the value a flouresence time series.
 
-These key-value records can, in principle, be stored in a variety of cluster-accessible formats, and it does not affect the core functionality (besides loading). Currently, the loading function assumes a text file input, where the rows are neural signals, and the columns are the keys and values, each number separated by space. Support for flat binary files is coming soon.
+These key-value records can be derived from variety of cluster-accessible formats. thunder currently includes methods for loading data from text or flat binary files stored locally, in HDFS, or on a networked file system, and preliminary support for importing and converting data from other formats.
 
 All metadata (e.g. parameters of the stimulus or behavior for regression analyses) can be provided as numpy arrays or loaded from MAT files, see relavant functions for more details.
 
-Results can be visualized directly from the python shell ir iPython notebook, or saved as MAT files, text files, or images.
+Results can be visualized directly from the python shell or in iPython notebook, or saved as images, MAT files. Other output formats coming soon. 
 
 Road map
 ----------------
 If you have other ideas or want to contribute, submit an issue or pull request!
 
-- New file formats for input data
-- Automatic extract-transform-load for different raw formats (e.g. raw images)
+- Integrate more scikit learn functionality
 - Analysis-specific visualizations
+- Input format support: HD5, tif
+- Port versions of most common workflows to scala
 - Unified metadata representation
 - Streaming analyses
-- Port versions of most common workflows to scala
diff --git a/python/doc/basic_usage.rst b/python/doc/basic_usage.rst
@@ -11,20 +11,19 @@ For an interactive analysis, we first start the shell
 
 	thunder
 
-Import the functions and classes we'll need, in this case ``DataSets`` and ``Stats``.
+Import the functions and classes we'll need, in this case ``Stats``.
 
 .. code-block:: python
 
-	>> from thunder.utils import DataSets
 	>> from thunder.timeseries import Stats
 
 First we load some toy example data
 
 .. code-block:: python
 
-	>> data = DataSets.load(sc, "zebrafish-test")
+	>> data = tsc.loadExample("fish")
 
-Then use the stats class to compute the standard deviation
+``tsc`` is a modified Spark Context, created when you start thunder, that serves as an entry point for loading distributed datasets. Once the data is loaded, use the stats class to compute the standard deviation
 
 .. code-block:: python
 

diff --git a/python/doc/install_ec2.rst b/python/doc/install_ec2.rst
@@ -64,7 +64,9 @@ Use the iPython notebook
 ~~~~~~~~~~~~~~~~~~~~~~~~
 The iPython notebook is an especially useful way to do analyses interactively and look at results.
 
-To setup the iPython notebook on EC2, just login to your cluster
+To setup the iPython notebook on EC2, you need to do one manual port configuration on the AWS console website. Go to the `EC2 dashboard <https://console.aws.amazon.com/ec2/v2/home>`_, click on "Security groups" in the list on the left, and find the name of your cluster in the list, and click on the entry "<cluster-name>-master". So if you called your cluster "test", look for "test-master". After selecting it, in the panel below, click the "Inbound" tab, click "Edit", click "Add rule", then type 8888 in the port range, and select "Anywhere" under source, then click "Save". 
+
+The rest is easy. Just login to your cluster
 
 .. code-block:: bash
 

diff --git a/python/doc/install_local.rst b/python/doc/install_local.rst
@@ -9,13 +9,13 @@ To follow along with the instructions below, you'll just need a command line (e.
 
 Installing Spark 
 ~~~~~~~~~~~~~~~~
-First you need a working installation of Spark. You can `download <http://spark.apache.org/downloads.html>`_ one of the pre-built versions, or you can download the sources, and follow `these instructions <http://spark.apache.org/docs/latest/building-with-maven.html>`_ to build from source.
+First you need a working installation of Spark. You can `download <http://spark.apache.org/downloads.html>`_ one of the pre-built versions (pick the one labeled Hadoop 2), or you can download the sources, and follow `these instructions <http://spark.apache.org/docs/latest/building-with-maven.html>`_ to build from source.
 
 Once you have downloaded Spark, set an environmental variable by typing the following into the terminal (here we assume you downloaded a pre-built version and put it in your downloads folder)
 
 .. code-block:: bash
 
-	export SPARK_HOME=~/downloads/spark-1.0.1-bin-hadoop1
+	export SPARK_HOME=~/downloads/spark-1.0.2-bin-hadoop2
 
 To make this setting permanent, add the above line to your bash profile (usually located in ``~/.bash_profile`` on Mac OS X), and open a new terminal so the change takes effect. Otherwise, you'll need to enter this line during each terminal session.
 
@@ -41,11 +41,19 @@ If the installation was successful you will have three new commands available: `
 
 If you type ``thunder`` into the terminal it will start an interactive session in the Python shell.
 
+If you want to upgrade to the latest version, we recommend first uninstalling using 
+
+.. code-block:: bash
+	
+	pip uninstall thunder-python
+
+And then install again to get the latest version.
+
 Dependencies 
 ~~~~~~~~~~~~
 Along with Spark, thunder depends on these Python libraries (by installing using ``pip`` these will be added automatically).
 
-`numpy <http://www.numpy.org/>`__, `scipy <http://www.scipy.org/>`__, `scikit-learn <http://scikit-learn.org/stable/>`__, `PIL <http://www.pythonware.com/products/pil/>`__, `matplotlib <matplotlib.sourceforge.net>`__
+`numpy <http://www.numpy.org/>`__, `scipy <http://www.scipy.org/>`__, `scikit-learn <http://scikit-learn.org/stable/>`__, `pillow <https://pypi.python.org/pypi/Pillow>`__, `matplotlib <matplotlib.sourceforge.net>`__
 
 We recommend using the `Anaconda distribution <https://store.continuum.io/cshop/anaconda/>`_, which includes these dependencies (and many other useful packages). Especially if you aren't already using Python for scientific computing, it's a great way to start.