Chances are that every time you hear the phrase ‘data science’, it’s quickly linked with the word ‘programming’. This conception is related to the idea that the best data scientists have a strong numerical, logical background – so programmers, developers make the best data scientists, according to this idea. So does this mean data science is beyond you if you didn’t have a fiery love affair with NodeJs, Php or C++ back in school?
With the advent of data science and everything data analysis, numerous people from all sorts of backgrounds are eager to jump into the domain. Like I pointed out in a previous post about data democratisation, this is a good and coveted trend in our society. However most of our workforce today are from a non-programming background. So it would be great news for most people to discover that you don’t need programming to be a data scientist! This is also fantastic news for people like me who come from a programming background but would rather spend less time tinkering with codes!
There are several data analytics tools out there now which favour user-friendly GUIs and this article is about them. It’s a good thing that many startups are focused on creating even better GUI driven analytics tools so I’ll aim to update this post as more tools are discovered.
Here are some tools:
-
RapidMiner
The RapidMiner is a subscription-based data science platform. It is made up of products that enable users from either a technical or nontechnical background to carry out advanced data analytics. RapidMiner works with several data source types, such as Excel, Access, IBM DB2 & SPSS, Oracle, Sybase, Microsoft SQL, MySQL, Ingres, Postgres, dBASE, CVS – the list goes on and on.
The platform comprises the following products:
RapidMiner Studio which features an intuitive GUI that enables code-free analysis processes for users. The GUI further allows users to easily filter and manipulate data to create data models or visualise data. As pointed out above it also works with several data source types.
RapidMiner Server supports team sharing and collaboration, and handles advanced analyses better through improved performance and scalability.
RapidMiner Radoop allows non-technical users to run RapidMiner Studio processes within a Hadoop cluster, via a user-friendly GUI; in this way users can leverage SparkR, PySpark, Pig and HiveQ scripts.
RapidMiner Extensions gives access to many RapidMiner extensions. This allows integration with R, Python, and web mining scripts as well as extends its capabilities in areas including text mining.
RapidMiner Cloud provides on-demand compute power and easy sharing among different devices.
2. Trifacta
Trifacta is an open source software best associated with data wrangling – that is, it is focused on data preparation.
It is available in three popular versions including:
- Trifacta Wrangler
- Wrangler Pro
- Trifacta Wrangler enterprise.
This platform aims at solving Excel’s shortcomings when it comes to dealing with huge data size. Boasting of a user-friendly GUI, it has a number of interesting features including advanced chart building, analysis insights as well as super quick report generation.
It automates machine learning and data visualisation in a way that helps non-technical users work efficiently with data.
3. WEKA
WEKA is an abbreviation for ‘Waikato Environment for Knowledge Analysis’ but also a flightless, inquisitive bird native to New Zealand.
It features a suite of machine learning algorithms for data mining.The user friendly GUI makes data preparation, data analysis and data visualisation easier for non-technical users. And as it’s also open-source, many data scientists prefer this platform.
4. DataRobot
There’s a dearth of data scientists and the core mission of DataRobot is to address this problem. According to Crunchbase, ‘DataRobot offers a machine learning platform for data scientists of all skill levels to build and deploy accurate predictive models in a fraction of the time it used to take’. The magic of DataRobot is that it removes focus from the need for programming, statistics and maths skills and allows the user focus on business and data collation.
5. MLbase
MLbase is an open source platform that aims at addressing two critical issues in data science – reducing the difficulties of implementing and applying Machine Learning to large scale problems. It comprises three components including:
MLlib
This functions as Apache Spark’s distributed ML library. Initially developed as a core section of the MLBase project, MLlib is now supported by the Spark community.
MLI
This is ‘an experimental API for feature extraction and algorithm development that introduces high-level ML programming abstractions’ (MLbase).
ML Optimizer
This layer aims to automating the task of ML pipeline construction. The optimizer solves a search problem over feature extractors and ML algorithms included in MLI and MLlib.
Finally…
These tools may be viewed as a threat to the data scientists’ job; but I also believe they provide more opportunities for data scientists. As more nontechnical people become more involved in data science, the market for technical data scientists will keep expanding.