hop2croft's software development Blog

RSS

Prediction APIs

It has been a couple of years since I’ve written here but I am going to add a new short post about predictions APIs today. Hopefully I will restart writing again 🙂

The initial idea for this post was that I was looking for a framework/API/service that allow to update a model built for a collaborative filtering use case. Most collaborative filtering solutions are based on the alternating least squares (ALS) that basically tries to define the matrix relationship between user and items into a smaller dimension user-item factors.

I’ve only have found one option to it, Oryx 2 (if anyone knows another please comment about that ;). Oryx 2 creates a lambda architecture using Apache Spark and Apache Kafka. From the Oryx documentation these layers are defined as:

A Batch Layer, which computes a new “result” (think model, but, could be anything) as a function of all historical data, and the previous result. This may be a long-running operation which takes hours, and runs a few times a day for example.
A Speed Layer, which produces and publishes incremental model updates from a stream of new data. These updates are intended to happen on the order of seconds.
A Serving Layer, which receives models and updates and implements a synchronous API exposing query operations on the result.
A data transport layer, which moves data between layers and receives input from external sources

Visually Oryx 2 architecture

Regarding to collaborative filtering, Oryx 2 has a ALS implementation that allows to update in memory and online the model that has been created to relation user with items. The implementation uses linear algebra to apply updates to the model once the new data arrives. Besides that Oryx 2 gives a damping factor approach to the algorithm that allows to indicate how new data affect the model. Once the model has been created/updated there is a nice Rest API to query the model that is stored in memory.

To execute the implementation of the ALS algorithm first, the three layers defined in Oryx 2 need to be run. One problem that I had is that some scripts need that the libraries, config files are in the same directory that the script. But after copy them, everything has worked nicely. There is already a configuration file for collaborative filtering in the Github project and a a list of the Rest API endpoints where querying the data or ask for the recommendations.

I really liked the project and that the code is open sourced is great. On the other hand I think that there is not a lot of documentation (specially comparing with other projects out there and comunity is not very wide, although Cloudera vendor is behind the project.

Other prediction projects

PredicitionIO. An open-source machine learning server for developers and data scientists to create predictive engines for production environments, with zero downtime training and deployment. It is built on top of Apache Spark, HBase and Spray. It has an use cases gallery where one can download a predefined template and run the examples easily. I really liked the project because it has implementations for the most used cases, the documentation is great. I also posted some question in the Github project and I was answered very rapidly. On the other hand some project has a fixed components like HBase or ElasticSearch that make them not so flexible.
Seldon.io. It is a enterprise and open source machine learning project that allow to create and end-to-end prediction engine (ingest data, create models and visualize the results). Documentation and example cases are great. Models are created using Spark as in Oryx 2 and it has a very modular and flexible model. It also has a prebuild VM to get started with the project. Both PredictionIO and Seldon.io have a very lively Github projects.
BigML. Allows to use machine learning libraries and also has these services in the Cloud. It is probably one of the older (but updated at the same time) in the market. In my opinion it has the better documentation in my opinion and its visual interface is really great.

Unfortunately, as far as I know none of the previous projects support updating models in a collaborative filtering project. If someone know any please comment about it 🙂

Deja un comentario

Publicado por hop2croft en 16 enero, 2016 en Big Data

Etiquetas: bigml, machine learning, oryx2, predicition, predictionio, seldonio

Introducción a Apache Oozie

13 Sep

Dentro de la serie de post acerca Hadoop y todo su ecosistemas de librerías surgidas alrededor de él, hoy vamos a ver Apache Oozie. Apache Oozie es una librería que nos permite definir una secuencia de ejecución de jobs Hadoop. Con Oozie se va a definir en un fichero de configuración un flujo de trabajo que definirá esta secuencia, en la cual se irán ejecutando las tareas Hadoop que indiquemos. Además podremos definir que hacer en caso de que las tareas se ejecuten de manera exitosa o no. Este fichero de configuración será un fichero XML.

Read the rest of this entry »

Deja un comentario

Publicado por hop2croft en 13 septiembre, 2013 en Big Data, Hadoop

Etiquetas: Big Data, Hadoop, Oozie

Apache Flume y Apache Sqoop

01 Sep

En el siguiente post vamos a hablar de dos librerías relacionadas con el manejo de grandes volumenes de datos , Apache Flume y Apache Sqoop . Aunque estas dos librerías tienen dos enfoques bastante distintos, la idea final de ambas es la misma. La funcionalidad de las dos es servir como mecanismo de ingestión de datos durante la fase inicial de adquisición de datos como ya se indicó en el post anterior Fases en Big Data y su relación con librerías Hadoop.

En primer lugar vamos a ver Flume, después Sqoop y terminaremos con una breve comparativa entre ambas.

Read the rest of this entry »

Deja un comentario

Publicado por hop2croft en 1 septiembre, 2013 en Hadoop

Etiquetas: Flume, Hadoop, HBase, Hive, Sqoop

Introducción a Hive

29 Ago

La primera librería relacionada con Hadoop de la que vamos a hablar en este blog va a ser Apache Hive. De la web oficial del proyecto Apache Hive:

Hive es un sistema de almacén de datos que facilita el manejo sencillo de datos, consultas ad-hoc, y el análisis de grandes conjuntos de datos almacenados en sistemas de ficheros compatibles con Hadoop. Hive provee un mecanismo para dotar de estructura en los datos y realizar consultas sobre los mismos con el lenguaje tipo SQL llamado HiveQL. Al mismo tiempo este lenguaje también permite a los programadores de Map/Reduce incluir sus propios mappers y reducers cuando no sea conveniente o eficiente expresar esta lógica con HiveQL.

Se puede leer estas mismas entradas en mi otro blog java4developers.com

Read the rest of this entry »

Deja un comentario

Publicado por hop2croft en 29 agosto, 2013 en Big Data, Hadoop

Etiquetas: Big Data, Hadoop, Hive

Fases en Big Data y librerías Hadoop

28 Ago

En los anteriores post hemos visto una breve introducción sobre Big Data y una librería como es Hadoop que permite manejar grandes volúmenes de datos. Además hemos hablado sobre las bases de Hadoop, en especial sobre MapReduce y el sistema de ficheros distribuidos HDFS. Si quieres puedes echarle un vistazo pinchando en cualquiera de los siguientes enlaces:

Recuerdo que se puede leer estas mismas entradas en mi otro blog java4developers.com

El motivo principal de este post es la relación entre las fases que existen en Big Data a la hora de procesar los datos y los frameworks y/o librerías que se han ido desarrollando bajo el ecosistema de Hadoop que se ejecutan durante esas mismas fases. En este post me quiero centrar más en la parte de Big Data y quizá dedicarle más adelante algún post más extenso a algunas de las librerías más utilizadas que han surgido bajo el paraguas Hadoop.

Read the rest of this entry »

2 comentarios

Publicado por hop2croft en 28 agosto, 2013 en Big Data

Etiquetas: Big Data, Data Mining, Hadoop, HDFS, MapReduce

Amazon Web Services Android Big Data Charlas Cloud Computing General GIT grails GWT Hadoop Hibernate J2EE Java JEE JPA Libros Maven Mobile RIA SCM Selenium SEO Spring Spring MVC Spring Web Flow Testing UI Design Uncategorized Vaadin web
Entradas recientes
Entradas y Páginas Populares
Mejor calificado
Ajax Amazon EC2 Amazon Elastic Compute Cloud Amazon Web Services Android ant AWS backbone.js balsamiq balsamiq mockups Big Data Cloud cloud computing cloud foundry Continuous Integration control de versiones Criteria CSS3 Dao Derby Eclipse EntityManager Facebook git GitHub Google Web Toolkit Grails groovy GWT Hadoop HDFS heroku Hibernate hibernate tools HootSuite HTML5 J2EE Java Javascript JEE jenkins JPA jquery JQuery Mobile JSF Junit Liferay MapReduce Maven media queries Mobile mobile development NamedQuery node.js paas Redis REST RestTemplate RIA Rich Internet Application scm Selenium SEO Spring Spring Android Spring Mobile Spring MVC Spring Web Flow STS Testing Twitter UI Vaadin vmware web
Archivos
- enero 2016 (1)
- septiembre 2013 (2)
- agosto 2013 (5)
- febrero 2013 (1)
- enero 2013 (2)
- noviembre 2012 (2)
- agosto 2012 (1)
- julio 2012 (2)
- May 2012 (2)
- abril 2012 (4)
- marzo 2012 (4)
- febrero 2012 (1)
- enero 2012 (2)
- diciembre 2011 (3)
- noviembre 2011 (2)
- octubre 2011 (5)
- septiembre 2011 (5)
- agosto 2011 (9)
- julio 2011 (7)
- junio 2011 (3)
- May 2011 (11)
- abril 2011 (9)
- marzo 2011 (3)
- febrero 2011 (5)
- febrero 2010 (1)
Calendario
May 2024

L M X J V S D

1 2 3 4 5

6 7 8 9 10 11 12

13 14 15 16 17 18 19

20 21 22 23 24 25 26

27 28 29 30 31

« Ene
Subscription

Escribe tu dirección de correo electrónico para suscribirte a este blog, y recibir notificaciones de nuevos mensajes por correo.

Dirección de correo electrónico:

Únete a otros 97 suscriptores
Twitter
Tuits de ivanfdezperea
Mapa de visitas
Estadísticas blog
- 580.812 visitas

hop2croft's software development Blog

Prediction APIs

Other prediction projects

Comparte esto:

Introducción a Apache Oozie

Comparte esto:

Apache Flume y Apache Sqoop

Comparte esto:

Introducción a Hive

Comparte esto:

Fases en Big Data y librerías Hadoop

Comparte esto:

Entradas recientes

Entradas y Páginas Populares

Mejor calificado

Archivos

Calendario

Subscription

Twitter

Mapa de visitas

Estadísticas blog