Python provides with an ample range of applications across computational sciences. On the other hand, it remains an individual responsibility to identify the most adequate methods, variables and ethical standards during any form of research. Some sort of Hippocratic Oath should hold the researcher accountable (and to mention the one for data scientists), which means every data point should follow a solid theory or hypothesis that lend itself to adequate triangulation from a qualitative perspective from multiple sciences. This is particularly true when intervening in the social, health and humanitarian sectors, since contextual drivers are interconnected and have unpredictable effect…

A core step in a data science methodology is to model data through either a classification, clustering or estimation task. At that end, there is a wide array of methods and algorithms to explore. This blog gives a basic introduction in R and Python to different types of regression analysis which is an important addition to other approaches presented before (**Neural Networks**, **Decision Trees****,** **Bayes Theorem****). **Regression analysis** **is a task that helps us to approximate the relationship between different types of predictors and target variables according to a set of statistical consideration.

A core step in a data science methodology is to model data to produce accurate predictions. At that end, there is a wide array of methods and algorithms to explore. This blog gives an introduction in R and Python to an evolving methodology when modelling data, in addition to previous approaches (**Decision Trees****,** **Bayes Theorem****), **which simulates how the brain functions.

Neural networks in data science represent an attempt to reproduce a non-linear learning that occurs in the network of neurons found in nature. A neuron consists of dendrites to gather inputs from other **neurons** and combines the input information…

Since the most decisive aspect of a data science methodology is to model data to produce estimations and powerful predictions, there is a wide array of methods and algorithms to explore at that aim. This blog presents another way to model data, beyond what explained in the previous blog on **decision trees****,** namely the Naïve Bayes classification method.

This classification method is a derivation of the **Bayes’s theorem** which describes the probability of an event, based on prior knowledge of conditions that might be related to the event. The ideator of this theorem is **Thomas Bayes** (1701-1761) who was an…

After having **prepared the dataframe**, **explored some key relationships**,** **and** ****set up the dataframe**,** **the next step is to model the data. This is probably the most decisive aspect of a data science methodology since there is a wide variety of methods and algorithms for modelling large data sets.

Algorithm: A procedure for solving a mathematical problem (as of finding the greatest common divisor) in a finite number of steps that frequently involves repetition of an operation

One of the simplest methods to model the data is the decision tree, which consists of a set of decision nodes connected to…

After having **prepared the dataframe** and **explored some key relationships**** **between variables in visual form, the next phase would entail a number of critical steps that are necessary before modelling the data. These include: 1) partitioning the data, 2) validating the data partition, 3) balancing the data and, 4) baseline model performance. The combination of these tasks prepare us to move ahead with modelling data in an optimal and more accurate way.

The proposed methodology illustrated in this series of blogs does not rely on statistical inferences as a mean to generalise from a sample population. The reason to avoid…

After having prepared the data as explained in my **previous blog**, individuals might have a priori hypothesis that they would like to test (HT). In other cases exploratory data analysis (EDA) is the driver for seeking significant patterns within a dataframe. The approach for exploration can be multifaceted and should lead to visual inputs that can describe relationships and set parameters to develop an accurate statistical model. The purpose of data exploration based on visual outputs is explained by the figure below.

With the rise of data science as an accessible tool for a wide range of computational needs, the importance to have fluency in multiple languages seems to be an advantage. The following blog seeks to help recognising the stylistic differences between the two languages whilst elaborating an approach that mirrors a cycle starting from understanding the question that entails a statistical model to the deployment of a solution.

There are multiple ways to create graphs in R and providing examples is useful to familiarise others on some snippets that can improve the visual output and add layers of analysis. The example in this blog would be an output in ggplot that shows labels in percentage value and in proportion to a grouping factor. The idea was to output a label value that describes a % and is proportionate to each sub-group (in this case Place A and Place B). Some additional aesthetic consideration also presented to share possible outlines ideas for your visual outputs.

`library(dplyr)`

library(ggplot2)

dplyr is…

Cultivating an interest in applying data science to international humanitarian work and social sciences.