It has been a couple of years since I’ve written here but I am going to add a new short post about predictions APIs today. Hopefully I will restart writing again 🙂
The initial idea for this post was that I was looking for a framework/API/service that allow to update a model built for a collaborative filtering use case. Most collaborative filtering solutions are based on the alternating least squares (ALS) that basically tries to define the matrix relationship between user and items into a smaller dimension user-item factors.
I’ve only have found one option to it, Oryx 2 (if anyone knows another please comment about that ;). Oryx 2 creates a lambda architecture using Apache Spark and Apache Kafka. From the Oryx documentation these layers are defined as:
- A Batch Layer, which computes a new “result” (think model, but, could be anything) as a function of all historical data, and the previous result. This may be a long-running operation which takes hours, and runs a few times a day for example.
- A Speed Layer, which produces and publishes incremental model updates from a stream of new data. These updates are intended to happen on the order of seconds.
- A Serving Layer, which receives models and updates and implements a synchronous API exposing query operations on the result.
- A data transport layer, which moves data between layers and receives input from external sources
Visually Oryx 2 architecture
Regarding to collaborative filtering, Oryx 2 has a ALS implementation that allows to update in memory and online the model that has been created to relation user with items. The implementation uses linear algebra to apply updates to the model once the new data arrives. Besides that Oryx 2 gives a damping factor approach to the algorithm that allows to indicate how new data affect the model. Once the model has been created/updated there is a nice Rest API to query the model that is stored in memory.
To execute the implementation of the ALS algorithm first, the three layers defined in Oryx 2 need to be run. One problem that I had is that some scripts need that the libraries, config files are in the same directory that the script. But after copy them, everything has worked nicely. There is already a configuration file for collaborative filtering in the Github project and a a list of the Rest API endpoints where querying the data or ask for the recommendations.
I really liked the project and that the code is open sourced is great. On the other hand I think that there is not a lot of documentation (specially comparing with other projects out there and comunity is not very wide, although Cloudera vendor is behind the project.
Other prediction projects
- PredicitionIO. An open-source machine learning server for developers and data scientists to create predictive engines for production environments, with zero downtime training and deployment. It is built on top of Apache Spark, HBase and Spray. It has an use cases gallery where one can download a predefined template and run the examples easily. I really liked the project because it has implementations for the most used cases, the documentation is great. I also posted some question in the Github project and I was answered very rapidly. On the other hand some project has a fixed components like HBase or ElasticSearch that make them not so flexible.
- Seldon.io. It is a enterprise and open source machine learning project that allow to create and end-to-end prediction engine (ingest data, create models and visualize the results). Documentation and example cases are great. Models are created using Spark as in Oryx 2 and it has a very modular and flexible model. It also has a prebuild VM to get started with the project. Both PredictionIO and Seldon.io have a very lively Github projects.
- BigML. Allows to use machine learning libraries and also has these services in the Cloud. It is probably one of the older (but updated at the same time) in the market. In my opinion it has the better documentation in my opinion and its visual interface is really great.
Unfortunately, as far as I know none of the previous projects support updating models in a collaborative filtering project. If someone know any please comment about it 🙂