Evolving in an AI world.: 2014

Tuesday, December 23, 2014

Amazon AWS Elastic Beanstalk, Python, Flask and Sci-Stack Docker

This actually took me longer than I'd like to admit to get working, but in the end the solution is quite neat and simple, so it was probably worth it, and hopefully this could save other people some time.

The Amazon Docker file looks like this:
AWS Elastic Beanstalk Dockerfile - Github

This installs the contents of the root folder requirements.txt before running your Docker file. So for my application the basic "non-sci" packages could be installed simply enough.

Root Folder: requirements.txt:
Then to install the sci related packages... numpy, scipy, pandas, scikit-learn and nltk. I created another requirements.txt in an aws-post-install folder. This is to be run once the Amazon linux OS has been updated and all the required OS dependencies have been installed.

Post Docker requirements.txt:
My custom docker file, that builds ontop of the Amazon image looked as follows:

Docker File:

Next step is to get my docker image to be used directly so that the Elastic Beanstalk app doesn't have to do all the downloads and installs every time should be simple enough according the AWS you tube channel: https://www.youtube.com/watch?v=pLw6MLqwmew

Tuesday, August 26, 2014

Why Jython when you can microservice with Flask

Over the last little while I have been working on Sibbly it's my little pet project to try summarize, group, filter and target software development information on the web. All 'n all a rather ambitious task, but the worst thing that could happen is that I learn something, so there is really no risk. It is still currently in a very closed beta, only occasionally showing it to fellow work colleagues and getting some input.

After initially starting development for Sibbly on Ubuntu, as I was always planning on deploying on Ubuntu, I had migrated back to windows, and after a couple weeks of work when finally deploying to Ubuntu... Surprise! it obviously did work right off the bat.

The issue I ended up with was, there seems to be a classpath issue between Spring Boot, it's embedded Tomcat instance and Jython. The reason I use Jython is for an awesome library called Pygments.

So after much dismay and checking all the Java alternatives and attempted Pygment ports (jygments, jgments), I started thinking of alternate solutions.
Having recently read: Microservices I decided to look at a way of interacting with Python more indirectly.
This lead me to: Flask
Within a couple minutes thanks to: Awesome Flask Example
I had the following up and running:

What this little bit of Python does is wrap and expose the highlight and guess functionality from Pygments via a RESTful service accepting and producing JSON.

I deploy Sibbly on DigitalOcean
To install Python on my droplet, I followed the process below:


sudo apt-get install python-dev build-essential  
sudo apt-get install zlib1g-dev
sudo apt-get install libssl-dev openssl
sudo apt-get install python-pip
sudo pip install virtualenv
sudo pip install virtualenvwrapper

export WORKON_HOME="$HOME/.virtualenvs"
source /usr/local/bin/virtualenvwrapper.sh

sudo mkdir /opt/python3.4.1
wget http://python.org/ftp/python/3.4.1/Python-3.4.1.tgz
tar xvfz Python-3.4.1.tgz
cd Python-3.4.1
./configure --prefix=/opt/python3.4.1
make  
sudo make install

mkvirtualenv --python /opt/python3.4.1/bin/python3 py-3.4.1

workon py-3.4.1

pip install flask
pip install pygments

Once that was done to run the Flask app:

python app.py & disown

Sunday, August 10, 2014

Upgrading Spring 3.x and Hibernate 3.x to Spring Platform 1.0.1 (Spring + hibernate 4.x)

I recent volunteered to upgrade our newest project to the latest version of Spring Platform. What Spring Platform gives you is dependency & plugin management across the whole Spring framework's set of libraries.

Since we had fallen behind a little the upgrade did raise some funnies. Here are the things I ran into:

Maven:
Our pom files were still referencing:
hibernate.jar
ehcache.jar
These artefacts don't exit on the latest version, so replaced those with
hibernate-core.jar and ehcache-core.jar

We also still use the hibernate tools + maven run plugin to reverse engineer our db object.
This I needed to update to a release candidate:

Hibernate:

The code: "Hibernate.createBlob"... no longer exists

replaced with:

On the HibernateTemplate
return types are now List; not element...So needed to add casts for the lists being returned.

import org.hibernate.classic.Session;
replaced with:
import org.hibernate.Session;

Reverse engineer works a little differently...

Assigns Long to numeric...
Added:

Possible Errors:

Caused by: org.hibernate.service.UnknownUnwrapTypeException: Cannot unwrap to requested type [javax.sql.DataSource]

Add a dependency for c3p0:

And configure the settings in the cfg.xml for it:

Caused by: java.lang.ClassNotFoundException: org.hibernate.engine.FilterDefinition

Probably still using a reference to hibernate3 factory / bean somewhere, change to hibernate4:
org.springframework.orm.hibernate3.LocalSessionFactoryBean
org.springframework.orm.hibernate3.HibernateTransactionManager

Caused by: java.lang.ClassNotFoundException: Could not load requested class : org.hibernate.hql.classic.ClassicQueryTranslatorFactory There is minor change in new APIs, so this can be resolved by replacing property value with:

org.hibernate.hql.internal.classic.ClassicQueryTranslatorFactory.

Spring:
Amazingly some of our application context files still referenced the Spring DTD ... replaced with XSD

In Spring configs added for c3p0:

Spring removed the "local"=: so needed to just change that to "ref"=

Spring HibernateDaoSupport no longer has: "releaseSession(session);", which is a good thing so was forced to update the code to work within a transaction.

Possible Errors:

getFlushMode is not valid without active transaction; nested exception is org.hibernate.HibernateException: getFlushMode is not valid without active transaction

Removed from hibernate properties:

<prop key="hibernate.current_session_context_class">thread</prop>

Supply a custom strategy for the scoping of the "current"Session. See Section 2.5, “Contextual sessions” for more information about the built-in strategies

org.springframework.dao.InvalidDataAccessApiUsageException: Write operations are not allowed in read-only mode (FlushMode.MANUAL): Turn your Session into FlushMode.COMMIT/AUTO or remove 'readOnly' marker from transaction definition.

Another option is :

java.lang.NoClassDefFoundError: javax/servlet/SessionCookieConfig

Servlet version update:

Then deploying on weblogic javassist: $$_javassist_ cannot be cast to javassist.util.proxy.Proxy

The issue here was that there were different versions of javassist being brought into the ear. I all references removed from all our poms, so that the correct version gets pulled in from from Spring/Hibernate...

and then configured weblogic to prefer our version:

Saturday, July 19, 2014

TDD, Hamcrest, Shazamcrest

Recently we have started to try get a more TDD culture started at work, having always believed in thorough testing and decent code coverage it shouldn't have been too hard. However... teaching a old dog new tricks can sometimes require quite a bit of patience. Turns out breaking coding habits formulated of more than a decade of keyboard bashing is harder than it seems.

So with generating an enormous amount of test code, comes the usual task code & test maintenance and reuse.
One of the tools / libraries we have included is Hamcrest, which not only improves the readability of assertion failures, but allows you to create and extend custom matchers, which you can then reuse across multiple test scenarios.

I am not going to go into too much detail on Hamcrest here, where are a bunch of great resources / blogs / tutorials out there.. just a few:
http://www.baeldung.com/hamcrest-collections-arrays
https://weblogs.java.net/blog/johnsmart/archive/2011/12/12/some-useful-new-hamcrest-matchers-collections
http://edgibbs.com/junit-4-with-hamcrest/
http://www.planetgeek.ch/2012/03/07/create-your-own-matcher/

While creating a custom type safe matcher for one of our domain objects, I realised that was insane.. really.. this.getA == that.getA... mmmm no.
So I went searching for something could help and and after a bit, I found: Shazamcrest (bonus points for the name)
What Shazamcrest does is:
Serialize the objects to compare.
Compares them and then on fail throws a ComparisonFailure, which the major IDE's allow you use their build in diff display.

Great... no manual bean compares.

So I add the maven dependency, try it out on our complex domain object....

StackOverflowError.... It was a known limitation at the time. The json provider Shazamcrest was using:

GSON does not cater for circular reference serialization.

As both Shazamcrest and GSON being opensource, I decided to have a look and see if I could contribute, anything is better that writing a manual bean matcher. After some investigation I found that the guys on the GSON project have created a fix GraphAdapterBuilder, it is just not distributed with the actual library.

So after fork on the Shazamcrest GitHub project, a little bit of code and submitting a pull request:

https://github.com/shazam/shazamcrest/pull/5

The guys on the Shazamcrest project very quickly merged my changes in and published a new version to the maven repo (Thanks for that).

So be sure to use the 0.8 version if you are struggling with circular references.

Monday, May 26, 2014

Playing with Java 8 - Lambdas, Paths and Files

I needed to read a whole bunch of files recently and instead of just grabbing my old FileUtils.java that I and probably most developers have and then copy from project to project, I decided to have quick look at how else to do it...
Yes, I know there is Commons IO and Google IO, why would I even bother? They probably do it better, but I wanted to check out the NIO jdk classes and play with lambdas aswell.. and to be honest, I think this actually ended up being a very neat bit of code.

So I had a specific use case:
I wanted to read all the source files from a whole directory tree, line by line.

What this code does, it uses Files.walk to recursively get all the paths from the starting point, it creates a stream, which I then filter to only files that end with the required extension. For each of those files, I use Files.lines to create a stream of Strings, one per line. I trim that, filter out the empty ones and add them to the return collection.
All very concise thanks to the new constructs.

Saturday, April 26, 2014

Playing with Java 8 - Lambdas and Concurrency

So Java 8 was released a while back, with a ton of features and changes. All us Java zealots have been waiting for this for ages, all the way back to from when they originally announced all the great features that will be in Java 7, which ended up being pulled.

I have just recently had the time to actually start giving it a real look, I updated my home projects to 8 and I have to say I am generally quite happy with what we got. The java.time API the "mimics" JodaTime is a big improvement, the java.util.stream package is going useful, lambdas are going to change our coding style, which might take a bit of getting used to and with those changes... the quote, "With great power comes great responsibility" rings true, I sense there may be some interesting times in our future, as is quite easy to write some hard to decipher code. As an example debugging the code I wrote below would be "fun"...

The file example is on my Github blog repo

What this example does is simple, run couple threads, do some work concurrently, then wait for them all to complete. I figured while I am playing with Java 8, let me go for it fully...
Here's what I came up with:
Test:
Output:

0 [pool-1-thread-1] Starting: StringInputTask{taskName='Task 1'}
0 [pool-1-thread-5] Starting: StringInputTask{taskName='Task 5'}
0 [pool-1-thread-2] Starting: StringInputTask{taskName='Task 2'}
2 [pool-1-thread-4] Starting: StringInputTask{taskName='Task 4'}
2 [pool-1-thread-3] Starting: StringInputTask{taskName='Task 3'}
3003 [pool-1-thread-5] Done: Task 5
3004 [pool-1-thread-3] Done: Task 3
3003 [pool-1-thread-1] Done: Task 1
3003 [pool-1-thread-4] Done: Task 4
3003 [pool-1-thread-2] Done: Task 2
3007 [Thread-0] WaitingFuturesRunner - complete... adding results

Some of the useful articles / links I found and read while doing this:

Oracle: Lambda Tutorial
IBM: Java 8 Concurrency
Tomasz Nurkiewicz : Definitive Guide to CompletableFuture

Sunday, February 16, 2014

Local Wikipedia with Solr and Spring Data

Continuing with my little AI / Machine Learning research project... I wanted to have a decent sized repo of English text, that was not in a complete mess like a large percentage of data on the internet. I figured I would try Wikipedia, but what to do with about 40Gb of XML? how do I work / query with all that data. I figured based on recent work implementation where we load something like 200 000 000 records on into a Solr cache, Solr would be the way to go, so the is an example of my basic implementation.

Required for this example:

Wikipedia download (warning it is a 9.9Gb file, extracts to about 42Gb)
Solr
Spring Data (Great Blog / Examples on Spring Data: Petri Kainulainen's blog)

All the code and unit test for this post is on my blog GitHub Repo

When setting up Solr from scratch, you can have a look at Solr's wiki or documentation, their documentation is pretty good. There is also an example of importing Wikipedia here, I started with that and made some minor modifications.

For this specific example the Solr config needed (/conf):
For this example (and in the below config files),
Solr home: /Development/Solr
Index / Data: /Development/Data/solr_data/wikipedia
Import File: /Development/Data/enwiki-latest-pages-articles.xml

The full import into Solr took about 48 hours on my old 2011 i5 iMac and the index on my current setup is about 52Gb.

Data Config for the import:

Schema:

Solr Config:

The code for this ended up being quite clean, Spring Data - Solr, gives 2 main interfaces SolrIndexService, and SolrCrudRespository, you simply extend / implement these 2, wrap that in a single interface, autowire from a Spring Java context and you good to go.

Repository:

IndexService:

SolrService:

SpringContext:

Next thing for me to look at for sourcing data is Spring Social.

Sunday, January 12, 2014

BYG (Bing, Yahoo, Google) Search Wrapper

One small section of my Aria project will be to interface with the current search engines out there. To do this I will require a module that will give me a consistent interface to work with the 3 main providers; Bing, Yahoo! and Google. (and any future ones I may want to add). This is a basic example or that module.

First thing required is to set up accounts / projects and the like with the relevant providers.
I won't describe this process as they were all pretty well documented.

Bing Developer Center
Yahoo Developer Network
Google Developers Console

A couple tips for the above sites.

Bing: Setup both the web and synonym searches.
Yahoo: In the BOSS console, under manage account, put in a daily limit $ amount (or turn of limit), as they only allow 1 free query a day... so only the first request works.
Google: It doesn't seem that you can set it up to search the whole web, but after creating your custom search engine, you can select "Search the entire web but emphasize included sites" so don't worry about that.

All these providers allow for many options while searching ( e.g. images, location, news, video etc.) , however in this initial example I have limited it to just a pure and simple web search.

All the code will be available in my blog Github repository.

Going through the main points.

There is a BasicWebSearch interface, that takes the search term and returns SearchResults.

SearchResults contains results in a map based on a result type enum.

The implementations of BasicWebSearch namely: BingSearch, GoogleSearch and YahooSearch call the relevant search engine with the search term and then convert the results into a SearchResult. In the case of Yahoo and Bing, I map the JSON result to the SearchResult. Google however does that in their search client included in the dependencies.

Now for the main code bits:

SearchSettings

As this is just an example, I use included the search settings in the following class, be sure to replace with the relevant values.

UrlConnectionHandler

As both Bing and Yahoo use an HttpUrlConnection, I figured I would centralise the handling of that, the only difference between the 2 is that Bing used basic authentication and Yahoo I went with the OAuth implementation.

BingSearch

BingResultParser

YahooSearch

YahooResultParser

GoogleSearch

GoogleSearchResult
Google has a whole bunch of extra information being returned so I extended the base SearchResult so add all the information just in case I ever need it.

Maven Dependencies