Material Big Data

Lanzados ppts informativos de tecnologías BigData: Hadoop, Hbase, Hive, Zookeeper...

Apuntate al Master en Big Data y Machine Learning en Perú!!

Diseño multidimensional, Big Data, ETL, visualización, open source, Machine Learning

Pentaho Analytics. Un gran salto

Ya se ha lanzado Pentaho 8 y con grandes sorpresas. Descubre con nosotros las mejoras de la mejor suite Open BI

La mejor oferta de Cusos Open Source

Después de la gran acogida de nuestros Cursos Open Source, eminentemente prácticos, lanzamos las convocatorias de 2018

19 feb. 2018

Si estas en Peru, programa de BigData, Machine Learning & Business Intelligence

Este interesante Curso supone una de las primeras participaciones de la compañía especialista en Analytics, Stratebi en Perú, en donde hay un gran interés en estas tecnologías y ya se están acometiendo algunos proyectos interesantes


Al finalizar el programa los estudiantes podrán:

  • Evaluar los fundamentos y conceptualizaciones que rigen las tecnologías del Data Science, BigData, Machine Learning & Business Intelligence.
  • Desarrollar soluciones de Business Intelligence mediante aplicaciones de BigData a través de Pentaho.
  • Desarrollar soluciones de Business Intelligence mediante aplicaciones de Machine Learning a través de Python, Apache Mahout, Spark y MLib.
  • Desarrollar Dashboards y soluciones de Data Visualization y Data Discovery.
  • Evaluar la calidad de los proyectos IT& Data Science enfocados a Business Intelligence.
  • Gestionar proyectos de Data Science, BigData, Machine Learning & Business Intelligence.
  • Aplicar las herramientas más avanzadas IT & Data Science para la creación de soluciones estructuradas de BI enfocadas a las ciencias, ingeniería y negocios.

Dirigido a: 

Profesionales de las tecnologías de información, gestores de TI, analistas de negocio, analistas de sistemas, arquitectos Java, desarrolladores de sistemas, administradores de bases de datos, desarrolladores y profesionales con relación al área de tecnología, marketing, negocio y financiera.

18 feb. 2018

Dear Data, arte en la visualizacion

Os recomendamos esta gran iniciativa, Dear Data, de Giorgia Lupi y Stefanie Posavec

Se trata de un libro colaborativo en el envío de cartas que convierte a los imagenes en arte y elegancia. Muy recomendable!!

17 feb. 2018

New Open Source free Analysis tool in Pentaho Marketplace

Hi, Pentaho Community fans, just available in Pentaho Marketplace in order to download STPivot4 Olap Viewer, so you have compiled and ready-to-use

STPivot4 is based on the old Pivot4J project where functionality has been added, improved and extended. These technical features are mentioned below.

GitHub STPivot4
For additional information, you may visit STPivot4 Project page at

Main Features:
  • STPivot4 is Pentaho plugin for visualizing OLAP cubes.
  • Deploys as Pentaho Plugin
  • Supports Mondrian 4!
  • Improves Pentaho user experience.
  • Intuitive UI with Drag and Drop for Measures, Dimensions and Filters
  • Adds key features to Pentaho OLAP viewer replacing JPivot.
  • Easy multi-level member selection.
  • Advanced and function based member selection (Limit, Ranking, Filter, Order).
  • Let user create "on the fly" formulas and calculations using
  • Non MDX gran totals (min,max,avg and sum) per member, hierarchy or axis.
  • New user friendly Selector Area
  • and more…

13 feb. 2018

Benchmarking 20 Machine Learning Models Accuracy and Speed

As Machine Learning tools become mainstream, and ever-growing choice of these is available to data scientists and analysts, the need to assess those best suited becomes challenging. In this study, 20 Machine Learning models were benchmarked for their accuracy and speed performance on a multi-core hardware, when applied to 2 multinomial datasets differing broadly in size and complexity. 

See Study

It was observed that BAG-CART, RF and BOOST-C50 top the list at more than 99% accuracy while NNET, PART, GBM, SVM and C45 exceeded 95% accuracy on the small Car Evaluation dataset

Visto en Rpubs

10 feb. 2018

How to create Web Dashboards from Excel

Now, you can create powerful Dashboards from excel for end users, with no single line of code. Just in seconds!! with STAgile, an open source based solution, with no licenses.

The best tool for non technical end users.

All the modules you can find in LinceBI are the right solution if you don´t want to pay licenses and you need profesional support

Besides, you have 'predefined industry oriented solutions', with a lot of KPIs, Dashboards, reports...

You can use STAgile, standalone or embed in your web application

8 feb. 2018

Comparativa Kettle (Pentaho Data Integration) y Talend

Hace unos días os hablábamos de que el ETL es crucial y hoy os mostramos una comparativa de las dos mejores herramientas Open Source de ETL (Kettle de Pentaho y Talend), que tampoco empieza a ser arriesgado a decir que se están convirtiendo en las mejores, sobre todo si valoramos el coste y la posibilidad de integración y modificación respecto a Informatica Powercenter, Oracle, Microsoft o IBM

Tanto Kettle como Talend son grandes herramientas, muy visuales, que nos permiten integrar todo tipo de fuentes, incluyendo también Big Data para hacer todo tipo de transformaciones y proyectos de integración o para preparar potentes entornos analíticos, también con soluciones Open Source como podéis ver en esta Demo Online, donde se han usado Kettle y Talend en el backend

Descargar la comparativa de Excella 

5 feb. 2018

Un glosario de los 7 principales terminos de Machine Learning

Machine learning

Machine learning is the process through which a computer learns with experience rather than additional programming.
Let’s say you use a program to determine which customers receive which discount offers. If it’s a machine-learning program, it will make better recommendations as it gets more data about how customers respond. The system gets better at its task by seeing more data.


An algorithm is a set of specific mathematical or operational steps used to solve a problem or accomplish a task.
In the context of machine learning, an algorithm transforms or analyzes data. That could mean:
• performing regression analysis—“based on previous experiments, every $10 we spend on advertising should yield $14 in revenue”
• classifying customers—“this site visitor’s clickstream suggests that he’s a stay-at-home dad”
• finding relationships between SKUs—“people who bought these two books are very likely to buy this third title”
Each of these analytical tasks would require a different algorithm.
When you put a big data set through an algorithm, the output is typically a model.


The simplest definition of a model is a mathematical representation of relationships in a data set.
A slightly expanded definition: “a simplified, mathematically formalized way to approximate reality (i.e. what generates your data) and optionally to make predictions from this approximation.”
Here’s a visualization of a really simple model, based on only two variables.
The blue dots are the inputs (i.e. the data), and the red line represents the model.

I can use this model to make predictions. If I put any “ad dollars spent” amount into the model, it will yield a predicted “revenue generated” amount.
Two key things to understand about models:
1. Models get complicated. The model illustrated here is simple because the data is simple. If your data is more complex, the predictive model will be more complex; it likely wouldn’t be portrayed on a two-axis graph.
When you speak to your smartphone, for example, it turns your speech into data and runs that data through a model in order to recognize it. That’s right, Siri uses a speech recognition model to determine meaning.
Complex models underscore why machine-learning algorithms are necessary: You can use them to identify relationships you would never be able to catch by “eyeballing” the data.
2. Models aren’t magic. They can be inaccurate or plain old wrong for many reasons. Maybe I chose the wrong algorithm to generate the model above. See the line bending down, as you pass our last actual data point (blue dot)? It indicates that this model predicts that past that point, additional ad spending will generate less overall revenue. This might be true, but it certainly seems counterintuitive. That should draw some attention from the marketing and data science teams.
A different algorithm might yield a model that predicts diminishing incremental returns, which is quite different from lower revenue.


Wikipedia’s definition of a feature is good: “an individual measurable property of a phenomenon being observed. Choosing informative, discriminating, and independent features is a crucial step for effective algorithms.”
So features are elements or dimensions of your data set.
Let’s say you are analyzing data about customer behavior. Which features have predictive value for the others? Features in this type of data set might include demographics such as age, location, job status, or title, and behaviors such as previous purchases, email newsletter subscriptions, or various dimensions of website engagement.
You can probably make intelligent guesses about the features that matter to help a data scientist narrow her work. On the other hand, she might analyze the data and find “informative, discriminating, and independent features” that surprise you.

Supervised vs. unsupervised learning

Machine learning can take two fundamental approaches.
Supervised learning is a way of teaching an algorithm how to do its job when you already have a set of data for which you know “the answer.”
Classic example: To create a model that can recognize cat pictures via a supervised learning process, you would show the system millions of pictures already labeled “cat” or “not cat.”
Marketing example: You could use a supervised learning algorithm to classify customers according to six personas, training the system with existing customer data that is already labeled by persona.
Unsupervised learning is how an algorithm or system analyzes data that isn’t labeled with an answer, then identifies patterns or correlations.
An unsupervised-learning algorithm might analyze a big customer data set and produce results indicating that you have 7 major groups or 12 small groups. Then you and your data scientist might need to analyze those results to figure out what defines each group and what it means for your business.
In practice, most model building uses a combination of supervised and unsupervised learning, says Doyle.

“Frequently, I start by sketching my expected model structure before reviewing the unsupervised machine-learning result,” he says. “Comparing the gaps between these models often leads to valuable insights.”

Deep learning

Deep learning is a type of machine learning. Deep-learning systems use multiple layers of calculation, and the later layers abstract higher-level features. In the cat-recognition example, the first layer might simply look for a set of lines that could outline a figure. Subsequent layers might look for elements that look like fur, or eyes, or a full face.

Compared to a classical computer program, this is somewhat more like the way the human brain works, and you will often see deep learning associated with neural networks, which refers to a combination of hardware and software that can perform brain-style calculation.

It’s most logical to use deep learning on very large, complex problems. Recommendation engines (think Netflix or Amazon) commonly use deep learning.

Visto en Huffingtonpost

1 feb. 2018

30 años del Data Warehouse

Justo ahora hace 30 años que Barry Devlin publicó el primer artículo describiendo la arquitectura de un Data Warehouse

Descargate el histórico artículo

Original publication: “An architecture for a business and information system”, B. A. Devlin, P. T. Murphy, IBM Systems Journal, Volume 27, Number 1, Page 60 (February, 1988)

31 ene. 2018

Una Wikipedia para la visualización de datos

Si alguna vez tienes dudas sobre cual es el mejor tipo de gráfico para usar en cada ocasión, puedes echar un vistazo a the Data Viz Project, en donde tienes más de 150 gráficos explicados y la mejor forma de usarles y sacar partido.

Una de las mejores partes de la web es donde se muestran ejemplos reales de aplicación práctica de cada uno de los gráficos:

29 ene. 2018

Working together PowerBI with the best open source solutions

Here you can see a nice sample combining PowerBI with open source based Business Intelligence solutions, like LinceBI, in order to provide the most complete BI solution with an affordable cost

- Predefined Dashboards
- Adhoc Reporting
- OLAP Analysis
- Adhoc Dashboarding
- Scorecards

More info:
- PowerBI functionalities
- PowerBI training