I have recently been spending some of my spare time learning and about AI and machine learning , after a couple books, a bunch of tutorials and most of Andrew Ng's Coursera course. I decided enough with the theory, time for some real code.
During all my late night reading I also stumbled across some of the following. The Loebner Prize, and it's most recent winner Mitsuku, A.L.I.C.E and Cleverbot and to be honest, maybe given my naivety of the field of AI, I expected much more from the above "AI" / technology, most of these current chatbots are easily confused and honestly not very impressive.
Thankfully I also found Eric Horvitz's video of his AI personal assistant, which resonated with what I wanted to achieve with my ventures into AI.
So, with human interaction as a focus point, I started designing: "Aria" - (Artificially Intelligent Research Assistant - in reverse). Which, since most of my development experience is based in the Java Enterprise environment, will be built on a distributed enterprise scale using the amazing technologies that it offers to mention some Hadoop, Spark, Mahout, Solr, MySQL, Neo4J, Spring...
My moonshot/daydream goal is to better the interactions of people and computers, but in reality if I only learn to use, implement and enjoy all that is involved with AI and ML, I will see myself as successful.
So, for my first bit of functional machine learning...
Predicting the next most probable part of speech. One of the issues with natural language processing is that words used in different contexts end up having different meanings and synonyms. To try assist with this I figured I would train a neural network with the relevant parts of speech, and then use that to assist in understanding user submitted text.
This full code for this example is available on Github.
I used a number of Java open source libraries for this:
Encog
Neuroph
Stanford NLP
Google Guava
I used a dataset of 29 000 English sentences that I sourced from a bunch of websites and open corpus's. I won't be sharing those as I have no clue what the state of the copyright is, so unfortunately to recreate this you'd need to source our own data.
For the neural network implementation, I tried both Neuroph and Encog. Neuroph got my attention first with their great UI to allow me to experiment with my neural network visually in the beginning, but as soon as I created my training data with ended up being about 300MB of 0's and 1's it fell over and didn't allow me to use it. I then began looking at Encog again as I had used initially when just starting to read about ML and AI
When using Neuroph in code it worked with the dataset, but then only with BackPropagation the ResilientPropagation implementation never seemed to return.
So I ended up much preferring Encog, it's resilient propagation implementation (iRPROP+) worked well and reduces the network error to about 0.018 in under 100 iterations, without me having to fine tune the settings and network architecture.
How this works, I take text data, I use the Stanford NLP library to generate a list of the parts of speech in the document. I translate their Annotation into an internal enum, and then use that to build up a training data set. I persist that to file currently, just to save some time while testing. I then train and persist the neural network and test it.
The Parts of Speech Enum:
The creation of the training data:
Train the network:
Test:
Sunday, December 15, 2013
Sunday, October 13, 2013
Setting up multiple versions of Python on Ubuntu
I recently switched from using a Mac back to a PC, I had originally planned to use both windows and linux via dual-boot, but having purchased a Radeon and Ubuntu not even starting from the bootable USB, I decided to try run my Python development environment on Windows. After playing with python on Windows, I found it quite tedious to have both a 2.7.5 and a 3.3.2 environment. I also didn't like having to rely on http://www.lfd.uci.edu/~gohlke/pythonlibs/ for all the 'pain' free install, since trying to compile some the libs with the required C++ compiler is even a bigger pain.
So I went with a colleagues suggestion of VMWare Player 6, and installed Ubuntu.
After breaking a couple installs and recreating VMs left and right, I finally have a process to install and work with multiple versions of Python.
First up, get a whole bunch of dependencies:
sudo apt-get install python-dev build-essential
sudo apt-get install python-pip
sudo apt-get install libsqlite3-dev sqlite3
sudo apt-get install libreadline-dev libncurses5-dev
sudo apt-get install libssl1.0.0 tk8.5-dev zlib1g-dev liblzma-dev
sudo apt-get build-dep python2.7
sudo apt-get build-dep python3.3
sudo pip install virtualenv
sudo pip install virtualenvwrapper
Add the virtualenvwrapper settings to ~.bashrc:
export WORKON_HOME="$HOME/.virtualenvs"
source /usr/local/bin/virtualenvwrapper.sh
Then for Python 2.7:
sudo mkdir /opt/python2.7.5
wget http://python.org/ftp/python/2.7.5/Python-2.7.5.tgz
tar xvfz Python-2.7.5.tgz
cd Python-2.7.5/
./configure --prefix=/opt/python2.7.5
make
sudo make install
mkvirtualenv --python /opt/python2.7.5/bin/python2 v-2.7.5
Then for Python 3.3:
sudo mkdir /opt/python3.3.2
wget http://python.org/ftp/python/3.3.2/Python-3.3.2.tgz
tar xvfz Python-3.3.2.tgz
cd Python-3.3.2
./configure --prefix=/opt/python3.3.2
make
sudo make install
To change between them:
workon [env name] e.g. v-3.3.2
Then to install some of the major scientific and machine learning related packages:
pip install numpy
pip install ipython[all]
pip install cython
sudo apt-get build-dep python-scipy
pip install scipy
pip install matplotlib
pip install scikit-learn
pip install pandas
To stop working on a particular version:
deactivate
So I went with a colleagues suggestion of VMWare Player 6, and installed Ubuntu.
After breaking a couple installs and recreating VMs left and right, I finally have a process to install and work with multiple versions of Python.
First up, get a whole bunch of dependencies:
sudo apt-get install python-dev build-essential
sudo apt-get install python-pip
sudo apt-get install libsqlite3-dev sqlite3
sudo apt-get install libreadline-dev libncurses5-dev
sudo apt-get install libssl1.0.0 tk8.5-dev zlib1g-dev liblzma-dev
sudo apt-get build-dep python2.7
sudo apt-get build-dep python3.3
sudo pip install virtualenv
sudo pip install virtualenvwrapper
Add the virtualenvwrapper settings to ~.bashrc:
export WORKON_HOME="$HOME/.virtualenvs"
source /usr/local/bin/virtualenvwrapper.sh
Then for Python 2.7:
sudo mkdir /opt/python2.7.5
wget http://python.org/ftp/python/2.7.5/Python-2.7.5.tgz
tar xvfz Python-2.7.5.tgz
cd Python-2.7.5/
./configure --prefix=/opt/python2.7.5
make
sudo make install
mkvirtualenv --python /opt/python2.7.5/bin/python2 v-2.7.5
Then for Python 3.3:
sudo mkdir /opt/python3.3.2
wget http://python.org/ftp/python/3.3.2/Python-3.3.2.tgz
tar xvfz Python-3.3.2.tgz
cd Python-3.3.2
./configure --prefix=/opt/python3.3.2
make
sudo make install
mkvirtualenv --python /opt/python3.3.2/bin/python3 v-3.3.2
To view the virtual environments:
lsvirtualenvTo change between them:
workon [env name] e.g. v-3.3.2
Then to install some of the major scientific and machine learning related packages:
pip install numpy
pip install ipython[all]
pip install cython
sudo apt-get build-dep python-scipy
pip install scipy
pip install matplotlib
pip install scikit-learn
pip install pandas
To stop working on a particular version:
deactivate
Sunday, September 15, 2013
Wordle... so nicely done
Discovered Wordle this morning, pointed to my blog... guess my recent posts really haven't been about java much :)
Saturday, September 7, 2013
Sourcing Twitter data, based on search terms
I started messing about with sourcing data from twitter, looking to use this with NLTK and maybe SOLR sometime in the future. I created a simple iPython Notebook on how go grab data from a twitter search stream, all details included in the notebook
I unfortunately couldn't find a simple way to imbed the notebook in Blogger, not wanting to waste time on that I just hosted it as a Gist. It can be viewed here: NBViewer
I unfortunately couldn't find a simple way to imbed the notebook in Blogger, not wanting to waste time on that I just hosted it as a Gist. It can be viewed here: NBViewer
Wednesday, September 4, 2013
Review: Learning IPython for Interactive Computing and Data Visualization
I have just completed working through Learning IPython for Interactive Computing and Visualization,
Having seen references to iPython from my first ever google for 'python', I somehow managed to disregarded it with the sentiment of who works in a console?? or a browser notebook? what is that? ...
I need an IDE with folders / modules / files / projects... what a shame I wasted so much time...
I blame too many years in Visual Studio, Eclipse, Jetbrains IDEs and XCode for making me ignore this long.
Thankfully I have gotten past that, and this book helps you getting there fast... < 150 pages fast.
IPython, and especially the IPython Notebooks are great tools. I can see it being awesome for a whole number of tasks:
That list can just go on and on, but coming back to the book. It was targeted at 2.7, obviously I didn't listen and worked through it in Pythong 3.3., but thankfully there were only a couple very minor changes:
The book uses urllib2 in a couple, that can be replaced with:
import urllib
r = urllib.request.urlopen(')
For the networkx example where was also a slight change:
sg = nx.connected_component_subgraphs(g)
This returned a list of graphs, not a graph, so I just looped the following:
for grp in sg:
nx.draw_networkx(grp, node_size...
Then for the maps exercise I did not have all the dependancies:
I need to Install GEOS...I used MacPorts for that:
sudo port install geos
Then in my .bash_profile I added:
export GEOS_DIR=/opt/local
To refresh the profile:
source ~/.bash_profile
Then for Basemap, downloaded the zip, here.
Followed by(in basemap-1.0.7 dir):
python setup.py install
That's about it, concise intro for a great product.
Now to really put it to the test the next book I am working through:
Building Machine Learning Systems with Python
Having seen references to iPython from my first ever google for 'python', I somehow managed to disregarded it with the sentiment of who works in a console?? or a browser notebook? what is that? ...
I need an IDE with folders / modules / files / projects... what a shame I wasted so much time...
I blame too many years in Visual Studio, Eclipse, Jetbrains IDEs and XCode for making me ignore this long.
Thankfully I have gotten past that, and this book helps you getting there fast... < 150 pages fast.
IPython, and especially the IPython Notebooks are great tools. I can see it being awesome for a whole number of tasks:
- learning python and working through books and tutorials
- running data mining brainstorming sessions
- showing people the latest and greatest stuff you've have come up
- quick cython implementations & performance experiments
- processing multiple cores / servers
- I even saw Harvard now uses it for homework assignments.
That list can just go on and on, but coming back to the book. It was targeted at 2.7, obviously I didn't listen and worked through it in Pythong 3.3., but thankfully there were only a couple very minor changes:
The book uses urllib2 in a couple, that can be replaced with:
import urllib
r = urllib.request.urlopen('
For the networkx example where was also a slight change:
sg = nx.connected_component_subgraphs(g)
This returned a list of graphs, not a graph, so I just looped the following:
for grp in sg:
nx.draw_networkx(grp, node_size...
Then for the maps exercise I did not have all the dependancies:
I need to Install GEOS...I used MacPorts for that:
sudo port install geos
Then in my .bash_profile I added:
export GEOS_DIR=/opt/local
To refresh the profile:
source ~/.bash_profile
Then for Basemap, downloaded the zip, here.
Followed by(in basemap-1.0.7 dir):
python setup.py install
That's about it, concise intro for a great product.
Now to really put it to the test the next book I am working through:
Building Machine Learning Systems with Python
Sunday, August 25, 2013
Things I learned while reading Programming Collective Intelligence.
I have been working through Programming Collective Intelligence over the last couple months. I have to say it's probably been one of the best learning experiences I have had in my years programming. Comparing to some of my previous technology stack / paradigm change experiences:
Muggle -> VB4
VB6 - > Java
Java -> .Net
Java -> iOS mobile / game development
This is the biggest, not so much just from the technology stack, but more purely due to the size and complexity of all things ML, AI. Not coming from a mathematical / statistical background, it's really quite a deep hole to jump into, and quite a challenge.
Not only did this book walk me through a bunch of machine learning and data analysis theory, it got me to learn Python and in translating to Java I also got introduced to a whole bunch on Java related tools and frameworks.
I created blog posts for chapters 2-8, and decided to just work through the Python for chapters 9, 10, 11 and 12, for 2 reasons;
1. Improve my Python
2. Get it done so I can move onto my new personal project, using all this ML and Python knowledge to create an cross platform application with a rich UI using either Kivy or QT.
To list some the ML / Data Analysis topics covered in PCI:
The Java tools, libs and frameworks investigated:
Python tools, libs and resources discovered:
Muggle -> VB4
VB6 - > Java
Java -> .Net
Java -> iOS mobile / game development
This is the biggest, not so much just from the technology stack, but more purely due to the size and complexity of all things ML, AI. Not coming from a mathematical / statistical background, it's really quite a deep hole to jump into, and quite a challenge.
Not only did this book walk me through a bunch of machine learning and data analysis theory, it got me to learn Python and in translating to Java I also got introduced to a whole bunch on Java related tools and frameworks.
I created blog posts for chapters 2-8, and decided to just work through the Python for chapters 9, 10, 11 and 12, for 2 reasons;
1. Improve my Python
2. Get it done so I can move onto my new personal project, using all this ML and Python knowledge to create an cross platform application with a rich UI using either Kivy or QT.
To list some the ML / Data Analysis topics covered in PCI:
- Classifiers
- Neural Networks
- Clustering
- Web crawlers
- Data indexers
- PageRank algorithm
- Genetic Algorithms
- Simulated Annealing
- K-Nearest Neighbours
- Bayesian filtering
- Decision trees
- Support vector machines
- Kernel Methods
- Linear Regression
- Evolving intelligence
The Java tools, libs and frameworks investigated:
- Encog
- Neo4J
- Google Guava
- Crawler4J
- Java Tuples
- Graphstream
- SQLite
- Rome
- JSoup
Python tools, libs and resources discovered:
Thursday, August 22, 2013
Getting Kivy to run on MacOSX with PyCharm and Virtual Env
Just had a little bit of a struggle getting Kivy to run from my PyCharm IDE, this is how I solved it
My initial Python environment setup was done are follows:
I installed my Python framework via MacPorts.
For Python 3:
sudo port install py33-numpy py33-scipy py33-matplotlib py33-ipython +notebook py33-pandas py33-sympy py33-nose
For Python 2.7:
sudo port install py27-numpy py27-scipy py27-matplotlib py27-ipython +notebook py27-pandas py27-sympy py27-nose
To set your MacPort Python to the default:
Then in PyCharm I created a virtual environment, and installed pip onto that. I could install Kivy directly in PyCharm, but it still requires PyGame to actually run.
PyGame I found requires X11 / XQuartz, which is no longer bundled with OSX and can be downloaded from:
http://xquartz.macosforge.org/landing/
Once that is installed.
Run the MacPort mercurial install first else you'll get "The command named 'hg' could not be found"
sudo port install mercurial
Then from the bin of my virtual env I could install pygame:
./pip-2.7 install hg+http://bitbucket.org/pygame/pygame
After that I could execute my App from the run configurations within PyCharm
My initial Python environment setup was done are follows:
I installed my Python framework via MacPorts.
For Python 3:
sudo port install py33-numpy py33-scipy py33-matplotlib py33-ipython +notebook py33-pandas py33-sympy py33-nose
For Python 2.7:
sudo port install py27-numpy py27-scipy py27-matplotlib py27-ipython +notebook py27-pandas py27-sympy py27-nose
To set your MacPort Python to the default:
For Python 3:
sudo port select --set ipython ipython33
sudo port select --set python python33
For Python 2.7:
sudo port select --set ipython ipython27
sudo port select --set python python27
Adding this to you .profile is probably a good idea when you use MacPorts:
export PATH=/opt/local/bin:/opt/local/sbin:$PATH
This installs pretty much all the major packages (some of which I could install via PyCharm's package install interface), including cython needed by Kivy.
sudo port select --set ipython ipython33
sudo port select --set python python33
For Python 2.7:
sudo port select --set ipython ipython27
sudo port select --set python python27
Adding this to you .profile is probably a good idea when you use MacPorts:
export PATH=/opt/local/bin:/opt/local/sbin:$PATH
This installs pretty much all the major packages (some of which I could install via PyCharm's package install interface), including cython needed by Kivy.
Then in PyCharm I created a virtual environment, and installed pip onto that. I could install Kivy directly in PyCharm, but it still requires PyGame to actually run.
PyGame I found requires X11 / XQuartz, which is no longer bundled with OSX and can be downloaded from:
http://xquartz.macosforge.org/landing/
Once that is installed.
Run the MacPort mercurial install first else you'll get "The command named 'hg' could not be found"
sudo port install mercurial
Then from the bin of my virtual env I could install pygame:
./pip-2.7 install hg+http://bitbucket.org/pygame/pygame
After that I could execute my App from the run configurations within PyCharm
Saturday, August 17, 2013
Creating a price model using k-Nearest Neighbours + Genetic Algorithm
Chapter 8 of Programming Collective Intelligence (PCI) explains the usage and implementation of the k-Nearest Neighbours algorithm. (k-NN).
Simply put:
k-NN is a classification algorithm that uses (k) for the number of neighbours to determine what class an item will belong to. To determine the neighbours to be used the algorithm uses a distance / similarity score function, in this example (Euclidian Distance).
PCI takes it a little further to help with accuracy in some scenarios. This includes the usage of a weighted average of the neighbours, as well as then using either simulated annealing or genetic algorithms to determine the best weights, building on Optimization techniques - Simulated Annealing & Genetic Algorithms
As with all the previous chapters the code is in my github repository.
So the similarity score function looked like (slightly different to the one used earlier, which was inverted to return 1 if equals):
The simulated annealing and genetic algorithm code I updated as I originally implemented them using Ints... (lesson learnt when doing anything it do with ML or AI, stick to doubles).
Then finally putting it all together my Java implementation of the PCI example
While reading up some more on k-NN I also stumbled upon the following blog posts
First one describing some of the difficulties around using k-NN.
k-Nearest Neighbors - dangerously simple
And then one giving a great overview of k-NN
A detailed introduction to k-NN algorithm
Simply put:
k-NN is a classification algorithm that uses (k) for the number of neighbours to determine what class an item will belong to. To determine the neighbours to be used the algorithm uses a distance / similarity score function, in this example (Euclidian Distance).
PCI takes it a little further to help with accuracy in some scenarios. This includes the usage of a weighted average of the neighbours, as well as then using either simulated annealing or genetic algorithms to determine the best weights, building on Optimization techniques - Simulated Annealing & Genetic Algorithms
As with all the previous chapters the code is in my github repository.
So the similarity score function looked like (slightly different to the one used earlier, which was inverted to return 1 if equals):
The simulated annealing and genetic algorithm code I updated as I originally implemented them using Ints... (lesson learnt when doing anything it do with ML or AI, stick to doubles).
Then finally putting it all together my Java implementation of the PCI example
While reading up some more on k-NN I also stumbled upon the following blog posts
First one describing some of the difficulties around using k-NN.
k-Nearest Neighbors - dangerously simple
And then one giving a great overview of k-NN
A detailed introduction to k-NN algorithm
Sunday, August 11, 2013
Decision Trees
I just completed working through Chapter 7 of Programming Collective Intelligence (PCI). This chapter demonstrates how, when and who you should use the decision tree construct. The method described was the CART technique.
The basic summary is: A decision tree has each branch node represent a choice between a number of alternatives, and each leaf node represents a decision or (classification). This makes decision tree another supervised machine learning algorithm useful in classifying information.
The main problem it overcome in defining a decision tree is how to identify the best split of the data points. To find this you need to go through all the sets of data, and identify which will give you the best split (gain) and start from there.
For some more technical information about this split / gain:
http://en.wikipedia.org/wiki/Information_gain_in_decision_trees
The biggest advantages I see in using a decision tree are:
It's easy it is to interpret and visualise.
Data didn't need to be normalised or something between -1 and 1.
Decision trees however cant be effectively used on large datasets with a large number of results.
As with my previous Classifiers post, I ended up using SQLite in memory db as it's such a pleasure to use. I did venture into using LambdaJ, but it actually ended up being such an ugly line of code I left it and simply did it manually. I have not looked at the Java 8 implementation of lambdas yet, I just hope it doesn't end in code like (with a whole bunch of static imports):
falseList.add(filter(not(having(on(List.class).get(col).toString(), equalTo((String) value))), asList(rows)));
So my java implementation of the PCI decision tree ended up looking like (All code in Github) :
(once again ... about 50% more code :) ).. really beginning to enjoy Python, I do see me using that for all future AI / ML type work as a first choice.
The basic summary is: A decision tree has each branch node represent a choice between a number of alternatives, and each leaf node represents a decision or (classification). This makes decision tree another supervised machine learning algorithm useful in classifying information.
The main problem it overcome in defining a decision tree is how to identify the best split of the data points. To find this you need to go through all the sets of data, and identify which will give you the best split (gain) and start from there.
For some more technical information about this split / gain:
http://en.wikipedia.org/wiki/Information_gain_in_decision_trees
The biggest advantages I see in using a decision tree are:
It's easy it is to interpret and visualise.
Data didn't need to be normalised or something between -1 and 1.
Decision trees however cant be effectively used on large datasets with a large number of results.
As with my previous Classifiers post, I ended up using SQLite in memory db as it's such a pleasure to use. I did venture into using LambdaJ, but it actually ended up being such an ugly line of code I left it and simply did it manually. I have not looked at the Java 8 implementation of lambdas yet, I just hope it doesn't end in code like (with a whole bunch of static imports):
falseList.add(filter(not(having(on(List.class).get(col).toString(), equalTo((String) value))), asList(rows)));
So my java implementation of the PCI decision tree ended up looking like (All code in Github) :
(once again ... about 50% more code :) ).. really beginning to enjoy Python, I do see me using that for all future AI / ML type work as a first choice.
Tuesday, July 30, 2013
Document Filtering - Classifiers
Chapter 6 of Programming Collective Intelligence (PCI) demonstrates how to classify documents based on their content.
I used one extra Java open source library for this chapter, and it's implementation was completely painless.
What a pleasure, simple maven include, and thats it's little file or memory based SQL based db in your code.
My full java implementation of some of the topics are available on my GitHub repo, but will highlight the Fisher Method (or Fisher's discriminant analysis or LDA) if you want to get a lot more technical.
What has made PCI a good book is it's ability to summarise quite complex theoretical and mathematical concepts down to basics and code, for us lowly developers use to practically.
To Quote:
"the Fisher method calculates the probability of a category for each feature of the document, then combines the probabilities and test to see if the set of probabilities is more or less likely than a random set. This method also returns a probability for each category that can be compared to others"
During the writing of this post, I discovered the following blog:
Shape of data
Seems well worth the read, will be spending the next couple days on that before continuing with PCI, chapter 7.. Decision Trees.
Monday, July 22, 2013
Optimization techniques - Simulated Annealing & Genetic Algorithms
Chapter 5 of Programming Collective Intelligence (PCI) deals with optimisation problems.
To Quote:
"Optimisation finds the best solution to a problem by trying many different solution and scoring them to determine their quality. Optimisation is typically used in cases where there are too many possible solutions to try them all"
Before embarking on this chapter I decided that it would be best to quickly learn Python, there just seems to be a lot of Python around as soon as you start learning and reading about machine learning and data analysis, it can't actually be ignored.
(Still not sure why this is the case, but set out to get up an running with Python, in 1 weekend)
Some of the resources I used:
http://www.python.org
http://www.stavros.io/tutorials/python/
http://www.diveintopython.net/index.html
http://docs.python-guide.org/en/latest/
As a developer, learning the basics of Python really isn't very difficult, to be honest it probably took me longer to find an development environment I was happy with, consoles and text editors just don't do it for me.
The main ones I investigated were:
Ninja IDE (Free)
Eclipse + PyDev (Free)
PyCharm ($99)
I spent quite a bit of time playing with Ninja IDE and Eclipse, but there were just little things that kept bugging me, from strange short cuts to highlighting correct code / syntax as incorrect.
10 minutes after installing PyCharm, I was sold. To be fair, I am probably not the best person to judge.
I code in IntelliJ daily and actually ended up converting all the java developers in my department to drop Eclipse and start using IntelliJ.... I also did the majority of my Objective-C work in AppCode, in other words... I am a JetBrains fanboy, happy to hand over my money for an awesome tool.
Getting back to PCI, where were a couple issues with the code in this chapter, which caused me (a person that just learnt Python) a little bit of pain, 'cause I figured the code had to be right and I was just doing something wrong, eventually I went searching and found:
PCI Errata
With that I corrected the issues in the hillclimb and genetic algorithm functions:
The java implementation for the 3 functions ended up twice as long and looking like:
And unlike my previous posts on PCI, I didn't use a whole bunch of open source libraries, only added one.
Java Tuples.
The whole Chapter 5 Optimisation solution is in my Blog Github repo, the concepts used in both the Simulated Annealing and Genetic Algorithm could easily be adapted and used again if looking for a simple example of those concepts.
Now for Chapter 6 ... Document Filtering...
To Quote:
"Optimisation finds the best solution to a problem by trying many different solution and scoring them to determine their quality. Optimisation is typically used in cases where there are too many possible solutions to try them all"
Before embarking on this chapter I decided that it would be best to quickly learn Python, there just seems to be a lot of Python around as soon as you start learning and reading about machine learning and data analysis, it can't actually be ignored.
(Still not sure why this is the case, but set out to get up an running with Python, in 1 weekend)
Some of the resources I used:
http://www.python.org
http://www.stavros.io/tutorials/python/
http://www.diveintopython.net/index.html
http://docs.python-guide.org/en/latest/
As a developer, learning the basics of Python really isn't very difficult, to be honest it probably took me longer to find an development environment I was happy with, consoles and text editors just don't do it for me.
The main ones I investigated were:
Ninja IDE (Free)
Eclipse + PyDev (Free)
PyCharm ($99)
I spent quite a bit of time playing with Ninja IDE and Eclipse, but there were just little things that kept bugging me, from strange short cuts to highlighting correct code / syntax as incorrect.
10 minutes after installing PyCharm, I was sold. To be fair, I am probably not the best person to judge.
I code in IntelliJ daily and actually ended up converting all the java developers in my department to drop Eclipse and start using IntelliJ.... I also did the majority of my Objective-C work in AppCode, in other words... I am a JetBrains fanboy, happy to hand over my money for an awesome tool.
Getting back to PCI, where were a couple issues with the code in this chapter, which caused me (a person that just learnt Python) a little bit of pain, 'cause I figured the code had to be right and I was just doing something wrong, eventually I went searching and found:
PCI Errata
With that I corrected the issues in the hillclimb and genetic algorithm functions:
The java implementation for the 3 functions ended up twice as long and looking like:
And unlike my previous posts on PCI, I didn't use a whole bunch of open source libraries, only added one.
Java Tuples.
The whole Chapter 5 Optimisation solution is in my Blog Github repo, the concepts used in both the Simulated Annealing and Genetic Algorithm could easily be adapted and used again if looking for a simple example of those concepts.
Now for Chapter 6 ... Document Filtering...
Wednesday, July 3, 2013
Mini Search Engine - Just the basics, using Neo4j, Crawler4j, Graphstream and Encog
Continuing to chapter 4 of Programming Collection Intelligence (PCI) which is implementing a search engine.
I may have bitten off a little more than I should of in 1 exercise. Instead of using the normal relational database construct as used in the book, I figured, I always wanted to have a look at Neo4J so now was the time. Just to say, this isn't necessarily the ideal use case for a graph db, but how hard could to be to kill 3 birds with 1 stone.
Working through the tutorials trying to reset my SQL Server, Oracle mindset took a little longer than expected, but thankfully there are some great resources around Neo4j.
Just a couple:
neo4j - learn
Graph theory for busy developers
Graphdatabases
Since I just wanted to run this as a little exercise, I decided to go for a in memory implementation and not run it as a service on my machine. In hindsight this was probably a mistake and the tools and web interface would have helped me visualise my data graph quicker in the beginning.
As you can only have 1 writable instance of the in memory implementation, I made a little double lock singleton factory to create and clear the DB.
Then using Crawler4j created a graph of all the URLs starting with my blog, their relationships to other URLs and all the words and indexes of the words that those URLs contain.
After the data was collected, I could query it and perform the functions of a search engine. For this I decided to use java futures as it was another thing I had only read about and not yet implemented. In my day to day working environment we use Weblogic / CommonJ work managers within the application server to perform the same task.
I then went about creating a task for each of the following counting the word frequency, document location, Page Rank and neural network (with fake input / training data) to rank the pages returned based on the search criteria. All the code is in my public github blog repo.
Disclaimer: The Neural Network task, either didn't have enough data to be affective, or I implemented the data normalisation incorrectly, so it is currently not very useful, I'll return to it once I have completed the journey through the while PCI book.
The one task worth sharing was the Page Rank one, I quickly read some of the theory for it, decided I am not that clever and went searching for a library that had it implemented. I discovered Graphstream a wonderful opensource project that does a WHOLE lot more than just PageRank, check out their video.
From that it was then simple to implement my PageRank task of this exercise.
In between all of this I found a great implementation of sorting a map by values on Stackoverflow.
The Maven dependencies used to implement all of this
Now to chapter 5 on PCI... Optimisation.
I may have bitten off a little more than I should of in 1 exercise. Instead of using the normal relational database construct as used in the book, I figured, I always wanted to have a look at Neo4J so now was the time. Just to say, this isn't necessarily the ideal use case for a graph db, but how hard could to be to kill 3 birds with 1 stone.
Working through the tutorials trying to reset my SQL Server, Oracle mindset took a little longer than expected, but thankfully there are some great resources around Neo4j.
Just a couple:
neo4j - learn
Graph theory for busy developers
Graphdatabases
Since I just wanted to run this as a little exercise, I decided to go for a in memory implementation and not run it as a service on my machine. In hindsight this was probably a mistake and the tools and web interface would have helped me visualise my data graph quicker in the beginning.
As you can only have 1 writable instance of the in memory implementation, I made a little double lock singleton factory to create and clear the DB.
Then using Crawler4j created a graph of all the URLs starting with my blog, their relationships to other URLs and all the words and indexes of the words that those URLs contain.
After the data was collected, I could query it and perform the functions of a search engine. For this I decided to use java futures as it was another thing I had only read about and not yet implemented. In my day to day working environment we use Weblogic / CommonJ work managers within the application server to perform the same task.
I then went about creating a task for each of the following counting the word frequency, document location, Page Rank and neural network (with fake input / training data) to rank the pages returned based on the search criteria. All the code is in my public github blog repo.
Disclaimer: The Neural Network task, either didn't have enough data to be affective, or I implemented the data normalisation incorrectly, so it is currently not very useful, I'll return to it once I have completed the journey through the while PCI book.
The one task worth sharing was the Page Rank one, I quickly read some of the theory for it, decided I am not that clever and went searching for a library that had it implemented. I discovered Graphstream a wonderful opensource project that does a WHOLE lot more than just PageRank, check out their video.
From that it was then simple to implement my PageRank task of this exercise.
In between all of this I found a great implementation of sorting a map by values on Stackoverflow.
The Maven dependencies used to implement all of this
Now to chapter 5 on PCI... Optimisation.
Monday, July 1, 2013
A couple useful Oracle XE admin commands
I struggled a bit trying to get my local Oracle XE up and running after a couple months of being dormant.
Firstly: Oracle XE 11g sets password expiry by default. Quiet annoying...
So my system account was locked.
How to unlock that I did the following on the window command prompt:
set ORACLE_SID=XE
set ORACLE_HOME= "ORACLE_PATH" (D:\OracleXe\app\oracle\product\11.2.0\server) in my case.
sqlplus / as sysdba
ALTER USER SYSTEM identified by password;
If the account is locked run:
ALTER USER system ACCOUNT UNLOCK;
Then, to ensure that it does not expire again:
ALTER PROFILE DEFAULT LIMIT
FAILED_LOGIN_ATTEMPTS UNLIMITED
PASSWORD_LIFE_TIME UNLIMITED;
One more thing I needed to change since I had installed a local Tomcat, is the default HTTP port for XE.
This can be done with 3010 is the new port:
Exec DBMS_XDB.SETHTTPPORT(3010)
Firstly: Oracle XE 11g sets password expiry by default. Quiet annoying...
So my system account was locked.
How to unlock that I did the following on the window command prompt:
set ORACLE_SID=XE
set ORACLE_HOME= "ORACLE_PATH" (D:\OracleXe\app\oracle\product\11.2.0\server) in my case.
ALTER USER system ACCOUNT UNLOCK;
ALTER PROFILE DEFAULT LIMIT
FAILED_LOGIN_ATTEMPTS UNLIMITED
PASSWORD_LIFE_TIME UNLIMITED;
Sunday, June 16, 2013
Blog Categorisation using Encog, ROME, JSoup and Google Guava
Continuing with Programming Collection Intelligence (PCI) the next exercise was using the distance scores to pigeonhole a list of blogs based on the words used within the relevant blog.
I had already found Encog as the framework for the AI / Machine learning algorithms, for this exercise I needed an RSS reader and a HTML parser.
The 2 libraries I ended up using were:
ROME
JSoup
For general other utilities and collection manipulations I used:
Google Guava
I kept the list of blogs short, included some of the software bloggers I follow, just to make testing quick, had to alter the %'s a little from the implementation in (PCI), but still got the desired result.
Blogs Used:
http://blog.guykawasaki.com/index.rdf
http://blog.outer-court.com/rss.xml
http://flagrantdisregard.com/index.php/feed/
http://gizmodo.com/index.xml
http://googleblog.blogspot.com/rss.xml
http://radar.oreilly.com/index.rdf
http://www.wired.com/rss/index.xml
http://feeds.feedburner.com/codinghorror
http://feeds.feedburner.com/joelonsoftware
http://martinfowler.com/feed.atom
http://www.briandupreez.net/feeds/posts/default
For the implementation I just went with a main class and a reader class:
Main:
The Results:
*** Cluster 1 ***
[http://www.briandupreez.net/feeds/posts/default]
*** Cluster 2 ***
[http://blog.guykawasaki.com/index.rdf]
[http://radar.oreilly.com/index.rdf]
[http://googleblog.blogspot.com/rss.xml]
[http://blog.outer-court.com/rss.xml]
[http://gizmodo.com/index.xml]
[http://flagrantdisregard.com/index.php/feed/]
[http://www.wired.com/rss/index.xml]
*** Cluster 3 ***
[http://feeds.feedburner.com/joelonsoftware]
[http://feeds.feedburner.com/codinghorror]
[http://martinfowler.com/feed.atom]
I had already found Encog as the framework for the AI / Machine learning algorithms, for this exercise I needed an RSS reader and a HTML parser.
The 2 libraries I ended up using were:
ROME
JSoup
For general other utilities and collection manipulations I used:
Google Guava
I kept the list of blogs short, included some of the software bloggers I follow, just to make testing quick, had to alter the %'s a little from the implementation in (PCI), but still got the desired result.
Blogs Used:
http://blog.guykawasaki.com/index.rdf
http://blog.outer-court.com/rss.xml
http://flagrantdisregard.com/index.php/feed/
http://gizmodo.com/index.xml
http://googleblog.blogspot.com/rss.xml
http://radar.oreilly.com/index.rdf
http://www.wired.com/rss/index.xml
http://feeds.feedburner.com/codinghorror
http://feeds.feedburner.com/joelonsoftware
http://martinfowler.com/feed.atom
http://www.briandupreez.net/feeds/posts/default
For the implementation I just went with a main class and a reader class:
Main:
The Results:
*** Cluster 1 ***
[http://www.briandupreez.net/feeds/posts/default]
*** Cluster 2 ***
[http://blog.guykawasaki.com/index.rdf]
[http://radar.oreilly.com/index.rdf]
[http://googleblog.blogspot.com/rss.xml]
[http://blog.outer-court.com/rss.xml]
[http://gizmodo.com/index.xml]
[http://flagrantdisregard.com/index.php/feed/]
[http://www.wired.com/rss/index.xml]
*** Cluster 3 ***
[http://feeds.feedburner.com/joelonsoftware]
[http://feeds.feedburner.com/codinghorror]
[http://martinfowler.com/feed.atom]
Wednesday, June 12, 2013
Regex POSIX expressions
I cant believe I only found out about these today, I obviously don't use regular expressions enough.
Posix Brackets
Quick Reference:
Posix Brackets
Quick Reference:
POSIX | Description | ASCII | Unicode | Shorthand | Java |
---|---|---|---|---|---|
[:alnum:] | Alphanumeric characters | [a-zA-Z0-9] | [\p{L&}\p{Nd}] | \p{Alnum} | |
[:alpha:] | Alphabetic characters | [a-zA-Z] | \p{L&} | \p{Alpha} | |
[:ascii:] | ASCII characters | [\x00-\x7F] | \p{InBasicLatin} | \p{ASCII} | |
[:blank:] | Space and tab | [ \t] | [\p{Zs}\t] | \p{Blank} | |
[:cntrl:] | Control characters | [\x00-\x1F\x7F] | \p{Cc} | \p{Cntrl} | |
[:digit:] | Digits | [0-9] | \p{Nd} | \d | \p{Digit} |
[:graph:] | Visible characters (i.e. anything except spaces, control characters, etc.) | [\x21-\x7E] | [^\p{Z}\p{C}] | \p{Graph} | |
[:lower:] | Lowercase letters | [a-z] | \p{Ll} | \p{Lower} | |
[:print:] | Visible characters and spaces (i.e. anything except control characters, etc.) | [\x20-\x7E] | \P{C} | \p{Print} | |
[:punct:] | Punctuation and symbols. | [!"#$%&'()*+,\-./:;<=>?@[\\\]^_`{|}~] | [\p{P}\p{S}] | \p{Punct} | |
[:space:] | All whitespace characters, including line breaks | [ \t\r\n\v\f] | [\p{Z}\t\r\n\v\f] | \s | \p{Space} |
[:upper:] | Uppercase letters | [A-Z] | \p{Lu} | \p{Upper} | |
[:word:] | Word characters (letters, numbers and underscores) | [A-Za-z0-9_] | [\p{L}\p{N}\p{Pc}] | \w | |
[:xdigit:] | Hexadecimal digits | [A-Fa-f0-9] | [A-Fa-f0-9] | \p{XDigit} |
Sunday, May 19, 2013
Some Java based AI Frameworks : Encog, JavaML, Weka
While working through I am working through Programming Collection Intelligence I found myself sending a lot of time translating the Python code to java, being typically impatient at my slow progress, I went searching for alternatives.
I found 3:
Encog - Heaton Research
JavaML
Weka
This is by no means an in-depth investigation, I simply downloaded what the relevant projects had available and quickly compared what was available to me to learn and implement AI related samples / applications.
So no surprise I went with Encog, and started on their video tutorials....
A couple hours later, first JUnit test understanding, training and testing a Hopfield neural network using the Encog libs.
I found 3:
Encog - Heaton Research
JavaML
Weka
This is by no means an in-depth investigation, I simply downloaded what the relevant projects had available and quickly compared what was available to me to learn and implement AI related samples / applications.
Encog
Advantages
- You Tube video tutorials
- E-Books available for both Java and .Net
- C# implementation
- Closure wrapper
- Seems active
Disadvantages
- Quite large code base to wrap your head around, this is probably due to the size of the domain we are looking at, but still much more intimidating to start off with vs. the Java ML library.
JavaML
Advantages
- Seems reasonably stable
- Well documented source code
- Well defined simple algorithm implementations
Disadvantages
- Lacks the tutorial support for a AI newbie like myself
Weka
Advantages
Disadvantages
- Could not install Weka 3-7-9 dmg... kept on giving me a "is damaged and can't be opened error, so left it there, as Sweet Brown says: "Ain't nobody got time for that".
A couple hours later, first JUnit test understanding, training and testing a Hopfield neural network using the Encog libs.
Saturday, May 11, 2013
Similarity Score Algorithms
As per my previous post, I am working through Programming Collection Intelligence the first couple algorithms described in this book are regarding finding a similarity score, the methods they work through are Euclidean Distance and the Pearson Correlation Coefficient. The Manhattan distance score is also mentioned but some what I could find it seems that it is just the sum of the (absolute) differences of their coordinates, instead of Math.pow 2 used in Euclidean distance.
I worked through this and wrote/found some java equivalents for future use:
Euclidean Distance:
Pearson Correlation Coefficient:
I worked through this and wrote/found some java equivalents for future use:
Euclidean Distance:
Pearson Correlation Coefficient:
Friday, May 3, 2013
Venture into AI, Machine Learning and all those algorithms that go with it.
It's been a 4 months since my last blog entry, I took it easy for a little while as we all need to do from time to time... but before long my brain got these nagging ideas and questions:
How hard can AI and Machine learning actually be?
How does it work?
I bet people are just over complicating it..
How are they currently trying to solve it?
Is it actually that difficult?
Could it be done it differently?
So off I went search the internet, some of useful sites I came across:
http://www.ai-junkie.com
Machine-learning Stanford Video course
Genetic algorithm example
I also ended up buying 2 books on Amazon:
Firstly, from many different recommendations:
Programming Collective Intelligence
I will be "working" through this book. While reading I will be translating, implementing and blogging the algorithms defined (in Python) as well as any mentioned that I will research separately in Java. Mainly for my own understanding and for the benefit of reusing them later, and an excuse to play with Java v7.
However, since I want to practically work through that book, I needed another for some "light" reading before sleep, I found another book from an article on MIT technology review Deep Learning, a bit that caught my eye was:
So the second book I purchased - On Intelligence
So far (only page upto page 54) 2 things have from this book have imbedded themselves in my brain:
"Complexity is a symptom of confusion, not a cause" - so so common in the software development world.
&
"AI defenders also like to point out historical instances in which the engineering solution differs radically from natures version"
...
"Some philosophers of mind have taken a shine to the metaphor of the cognitive wheel, that is, an AI solution to some problem that although entirely different from how the brain does it is just as good"
Jeff himself believes we need to look deeper into the brain for a better understanding, but could it be possible to have completely different approach to solve the "intelligence" problem?
How hard can AI and Machine learning actually be?
How does it work?
I bet people are just over complicating it..
How are they currently trying to solve it?
Is it actually that difficult?
Could it be done it differently?
So off I went search the internet, some of useful sites I came across:
http://www.ai-junkie.com
Machine-learning Stanford Video course
Genetic algorithm example
I also ended up buying 2 books on Amazon:
Firstly, from many different recommendations:
Programming Collective Intelligence
I will be "working" through this book. While reading I will be translating, implementing and blogging the algorithms defined (in Python) as well as any mentioned that I will research separately in Java. Mainly for my own understanding and for the benefit of reusing them later, and an excuse to play with Java v7.
However, since I want to practically work through that book, I needed another for some "light" reading before sleep, I found another book from an article on MIT technology review Deep Learning, a bit that caught my eye was:
For all the advances, not everyone thinks deep learning can move artificial intelligence toward something rivaling human intelligence. Some critics say deep learning and AI in general ignore too much of the brain’s biology in favor of brute-force computing.
One such critic is Jeff Hawkins, founder of Palm Computing, whose latest venture, Numenta, is developing a machine-learning system that is biologically inspired but does not use deep learning. Numenta’s system can help predict energy consumption patterns and the likelihood that a machine such as a windmill is about to fail. Hawkins, author of On Intelligence, a 2004 book on how the brain works and how it might provide a guide to building intelligent machines, says deep learning fails to account for the concept of time. Brains process streams of sensory data, he says, and human learning depends on our ability to recall sequences of patterns: when you watch a video of a cat doing something funny, it’s the motion that matters, not a series of still images like those Google used in its experiment. “Google’s attitude is: lots of data makes up for everything,” Hawkins says.
So the second book I purchased - On Intelligence
So far (only page upto page 54) 2 things have from this book have imbedded themselves in my brain:
"Complexity is a symptom of confusion, not a cause" - so so common in the software development world.
&
"AI defenders also like to point out historical instances in which the engineering solution differs radically from natures version"
...
"Some philosophers of mind have taken a shine to the metaphor of the cognitive wheel, that is, an AI solution to some problem that although entirely different from how the brain does it is just as good"
Jeff himself believes we need to look deeper into the brain for a better understanding, but could it be possible to have completely different approach to solve the "intelligence" problem?
Thursday, January 3, 2013
Weblogic JNDI & Security Contexts
Quite often when using multiple services / ejbs from different internal teams we have run into weblogic context / security errors, we always deduced the issue was how Weblogic handles it's contexts, I finally found weblogics' explanations their documents:
Link to the source Weblogic Docs: Weblogic JNDI
JNDI Contexts and Threads
When you create a JNDI Context with a username and password, you
associate a user with a thread. When the Context is created, the user is
pushed onto the context stack associated with the thread. Before
starting a new Context on the thread, you must close the first Context
so that the first user is no longer associated with the thread.
Otherwise, users are pushed down in the stack each time a new context
created. This is not an efficient use of resources and may result in the incorrect user being returned by
ctx.lookup()
calls. This scenario is illustrated by the following steps:
- Create a Context (with username and credential) called
ctx1
foruser1
. In the process of creating the context,user1
is associated with the thread and pushed onto the stack associated with the thread. The current user is nowuser1
. - Create a second Context (with username and credential) called
ctx2
foruser2
. At this point, the thread has a stack of users associated with it.User2
is at the top of the stack anduser1
is below it in the stack, souser2
is used is the current user. - If you do a
ctx1.lookup("abc")
call,user2
is used as the identity rather thanuser1
, becauseuser2
is at the top of the stack. To get the expected result, which is to havectx1.lookup("abc")
call performed asuser1
, you need to do actx2.close()
call. Thectx2.close()
call removesuser2
from the stack associated with the thread and so that actx1.lookup("abc")
call now usesuser1
as expected.
Note: | When the weblogic.jndi.enableDefaultUser flag is enabled, there are two situations where a close()
call does not remove the current user from the stack and this can cause
JNDI context problems. For information on how to avoid JNDI context
problems, see How to Avoid Potential JNDI Context Problems. |
How to Avoid Potential JNDI Context Problems
Issuing a
close()
call is usually as described in JNDI Contexts and Threads.
However, the following is an exception to the expected behavior that
occur when the weblogic.jndi.enableDefaultUser flag is enabled:
Last Used
When using IIOP, an exception to expected behavior arises when there is
one Context on the stack and that Context is removed by a
close()
.
The identity of the last context removed from the stack determines the
current identity of the user. This scenario is described in the
following steps:
- Create a Context (with username and credential) called
ctx1
foruser1
. In the process of creating the context,user1
is associated with the thread and stored in the stack, that is, the current identity is set touser1
. - Do a
ctx1.close()
call. - Do a
ctx1.lookup()
call. The current identity is user1. - Create a Context (with username and credential) called
ctx2
foruser2
. In the process of creating the context,user2
is associated with the thread and stored in the stack, that is, the current identity is set touser2
. - Do a
ctx2.close()
call. - Do a
ctx2.lookup()
call. The current identity is user2.
Link to the source Weblogic Docs: Weblogic JNDI
Subscribe to:
Posts (Atom)
Popular Posts
-
I have recently been slacking on content on my blog, between long stressful hours at work and to the wonderful toy that is an iPhone, I have...
-
I make no claim to be a "computer scientist" or a software "engineer", those titles alone can spark some debate, I regar...
-
I saw an article (well more of a rant) the other day, by Rob Williams Brain Drain in enterprise Dev . I have to say, I do agree with some o...
-
This series of posts will be about me getting to grips with JBoss Drools . The reasoning behind it is: SAP bought out my company's curre...
-
Update: Check out my updated re-certification on the new 2019 exam... here Let me start by saying, for this certification I studied and...