Moving from R to python - 1/7 - IDE
- 1 of 7: IDE
- 2 of 7: pandas
- 3 of 7: matplotlib and seaborn
- 4 of 7: plotly
- 5 of 7: scikitlearn
- 6 of 7: advanced scikitlearn
- 7 of 7: automated machine learning
Table of Contents
Introduction
Before I started with R
I used to do quite a lot of python
coding, however back in the days I was still using python 2.7
and was not really using a bona-fide data-centric workflow. In this series of notebooks I would like to document how my best-practices from using R
can be carried over to the python
universe.
IDE
R
has with RStudio
one obvious candidate for the best IDE to use with R
. It has been developed just like R
especially for maintaining a data-centric workflow. We have
- plotting
- interactive variable exploration
git
integration- code completion
- code documentation
- package building
- unit testing
- markdown support
- package and reproducibility management
- execution of
python
code
python
on the other hand has not been primarily developed for data science applications but has some great add-onn packages that can be used for scientific computation. There are a couple of python
IDE that mimick the RStudio
or the Matlab
interface such as spyder
and rodeo
they support plotting and interactive variable exploration and have great code completion but they lack git
and markdown support. The multi-purpose IDE pycharm
however seems to support also the other features that RStudio
is capable of. The professional version offers a scientific mode
that also mimicks the RStudio
interface to some degree and that provides a decent project structure.
Here we will walk through the features mentioned above and see how they are implemented in pycharm
Ploting and interactive variable exploration
This requirement is fullfilled by all IDEs.
git
integration
git
integration is pretty straight forward with more options than in RStudio
Code Completion
Code completion in RStudio
is pretty straight forward if you press Tab your namespace is sensibly searched for variables and functions that you might be typing. When you are typing a function it automatically displays the function documentation and which parameters the function is accepting. When you are defining the parameters of a function and you hit Tab RStudio
will show you a list of all parameters of the function that you have not defined yet. You can then select the parameter from the list In python
code completion is done by packages like jedi
which seems to be what all the IDEs are using however code completion feels a bit different. spyder
’s code completion looks and works exactly like the one of RStdio
. pycharm
has code completion but it tends to be cluttered with irrelevant object and method names when it does not find anything in the local namespace. In order to get the parameters of the function we also have to use an additional shortcut Ctrl + P (on windows). This only displays the parameter but we cannot select anything and thus have to type it ourselves.
Code Documentation
In R
we can only document functions properly when writing a package using roxygen2
comments. We can call the documentation of a function using ?
for example: ?myfunction()
. This will display nicely as html
in an additional viewer window. In order to generate consistent roxygen2
comments we can use the sinew
package which will generate a sensible template from a finished function. Code examples can be copy pasted into the console or can be called via example()
In python we can use help(myfunction)
or my_function.__doc__
to call the docstring associated with a function as console output. In pycharm
we can place the caret inside the name of any function and press ctrl + Q in order to obtain a more well formated pop-up window containing the docstring.
The docstring in python
can be written for any function and also makes sense outside of package writing. In pycharm
we can insert a docstring template in any definition of a function by placing the caret in the function name of the definition and hit alt + enter to get to a menu in which one option is to insert a docstring template based on the parameters of the function.
There is a google style sheet for writing docstrings. link
Example code in the docstring should be started with >>>>
and the line below should reflect the return value
>>>> len('foo')
3
Those examples can be run as unit tests by packages such as doctest
, nose
or unittests
Package Building
package building and installation is very well integrated in RStudio
, there is a steep learning curve but we can compile doumentation run tests and check all from within Rstudio
and we have several option to create latex
,pdf
or html
documentation.
In python
package building seems to be pretty straight forward. you basically just need a __init__.py
file in your directory and you have a package.
But there are some standards and nomenclature best practices that are described here.
Unit Testing
In R
we use testthat
and the integrated RStudio
build tools to write and execute unit tests. In python
the standard package seems to be unittest
while there are a number of different packages for unit testing with a simpler synthax such as py.test
. As already mentioned doctest
will run python examples in your docstring.
See this blogpost for more information.
Package Management and Code Reproducibility
Code reproducibililty and portability is a big issue in R
since there are so many packages that are constantly updated it is hard to track the dependencies for a specific analysis. Rstudio
relies on packrat
to ensure reproducibilty by saving all packages with each analysis or project which greatly increases the diskspace needed for a project. To tackle this I have actually written a package called (updateR
)[https://github.com/erblast/updateR] resorting to a workflow where I archive R
and all packages installed at this timepoint 4 times a year.
In python
we seem to have different options. We can use pip freeze
to create a requirements.txt
file. We can use an anaconda
environment which we can share with our analyis or we can use docker
. All three options seem to be well supported by pycharm
.
Markdown Support
RStudio
and markdown go hand in hand. .Rmd
are similar to .md
files and can contain executable code chunks in either R
or python
and can be rendered into numereous formats such as word, slides, dashboards, html
, ebooks, pdf
and blogposts. .Rmd
files are simple text files that contain a YAML
header that defines the rendering options. RStudio
even supports
one click publishing of html
content to Rpubs
a publication platform for R
code.
The python
equivalent is jupyter notebook
which uses the .ipynb
format which is basically a json
file. The json
format does not easily display in a text editor so we always have to use an IDE of some sort to edit those files. The best method is to start a jupyter notebook
kernel from the command line and connect to it via a web browser. This starts an easy to use webinterface that allows for editing .ipynb
files and feels a bit like a .Rmd
editing in RStudio
. We have different cells and we can define the content of this cell as either being markdown or code of a variety of languages and we can see the output of the code being console output or a plot right below the cells. Pretty much the same as it is with RStudio
.
The difference is in the file format, the Rmd
format does not save any code output inside the file while .ipynb
even stores binary information inside the document. This makes the changes in .ipynb
files difficult to track for git
and practically disqualifies them from being part of an entire analysis or deployment process. We can use Rmd
files as actual bulding blocks of an entire analysis for which we can have html
reporting for each steps that can be legible to none-coders. The use-case for jupyter notebook
seems to be documentation of data exploration, tutorials or proof of concepts.
As far as I can tell we can edit .ipynb
files inside pycharm
but I had difficulties saving the changes that I had made to the notebooks and ended up loosing a lot of work. It seems as if in pycharm
we need at least one code cell to save the file. I am more comfortable now working with the jupyter notebook
webinterface.
Execution of foreign code
In RStudio
we can use the reticulate
package to execute python
code and to convert python
into R
objects. We can also define python
chunks inside Rmd
documents. However we can only get python
code completion or interactive variable exploration when we convert the python
objects to R
objects.
Both pycharm
and jupyter notebook
support R. We can probably use feather
to pass dataframes between R
and python
.