Continuing with Programming Collection Intelligence (PCI) the next exercise was using the distance scores to pigeonhole a list of blogs based on the words used within the relevant blog.
I had already found Encog as the framework for the AI / Machine learning algorithms, for this exercise I needed an RSS reader and a HTML parser.
The 2 libraries I ended up using were:
ROME
JSoup
For general other utilities and collection manipulations I used:
Google Guava
I kept the list of blogs short, included some of the software bloggers I follow, just to make testing quick, had to alter the %'s a little from the implementation in (PCI), but still got the desired result.
Blogs Used:
http://blog.guykawasaki.com/index.rdf
http://blog.outer-court.com/rss.xml
http://flagrantdisregard.com/index.php/feed/
http://gizmodo.com/index.xml
http://googleblog.blogspot.com/rss.xml
http://radar.oreilly.com/index.rdf
http://www.wired.com/rss/index.xml
http://feeds.feedburner.com/codinghorror
http://feeds.feedburner.com/joelonsoftware
http://martinfowler.com/feed.atom
http://www.briandupreez.net/feeds/posts/default
For the implementation I just went with a main class and a reader class:
Main:
The Results:
*** Cluster 1 ***
[http://www.briandupreez.net/feeds/posts/default]
*** Cluster 2 ***
[http://blog.guykawasaki.com/index.rdf]
[http://radar.oreilly.com/index.rdf]
[http://googleblog.blogspot.com/rss.xml]
[http://blog.outer-court.com/rss.xml]
[http://gizmodo.com/index.xml]
[http://flagrantdisregard.com/index.php/feed/]
[http://www.wired.com/rss/index.xml]
*** Cluster 3 ***
[http://feeds.feedburner.com/joelonsoftware]
[http://feeds.feedburner.com/codinghorror]
[http://martinfowler.com/feed.atom]
Sunday, June 16, 2013
Wednesday, June 12, 2013
Regex POSIX expressions
I cant believe I only found out about these today, I obviously don't use regular expressions enough.
Posix Brackets
Quick Reference:
Posix Brackets
Quick Reference:
POSIX | Description | ASCII | Unicode | Shorthand | Java |
---|---|---|---|---|---|
[:alnum:] | Alphanumeric characters | [a-zA-Z0-9] | [\p{L&}\p{Nd}] | \p{Alnum} | |
[:alpha:] | Alphabetic characters | [a-zA-Z] | \p{L&} | \p{Alpha} | |
[:ascii:] | ASCII characters | [\x00-\x7F] | \p{InBasicLatin} | \p{ASCII} | |
[:blank:] | Space and tab | [ \t] | [\p{Zs}\t] | \p{Blank} | |
[:cntrl:] | Control characters | [\x00-\x1F\x7F] | \p{Cc} | \p{Cntrl} | |
[:digit:] | Digits | [0-9] | \p{Nd} | \d | \p{Digit} |
[:graph:] | Visible characters (i.e. anything except spaces, control characters, etc.) | [\x21-\x7E] | [^\p{Z}\p{C}] | \p{Graph} | |
[:lower:] | Lowercase letters | [a-z] | \p{Ll} | \p{Lower} | |
[:print:] | Visible characters and spaces (i.e. anything except control characters, etc.) | [\x20-\x7E] | \P{C} | \p{Print} | |
[:punct:] | Punctuation and symbols. | [!"#$%&'()*+,\-./:;<=>?@[\\\]^_`{|}~] | [\p{P}\p{S}] | \p{Punct} | |
[:space:] | All whitespace characters, including line breaks | [ \t\r\n\v\f] | [\p{Z}\t\r\n\v\f] | \s | \p{Space} |
[:upper:] | Uppercase letters | [A-Z] | \p{Lu} | \p{Upper} | |
[:word:] | Word characters (letters, numbers and underscores) | [A-Za-z0-9_] | [\p{L}\p{N}\p{Pc}] | \w | |
[:xdigit:] | Hexadecimal digits | [A-Fa-f0-9] | [A-Fa-f0-9] | \p{XDigit} |
Subscribe to:
Posts (Atom)
Popular Posts
-
I have recently been slacking on content on my blog, between long stressful hours at work and to the wonderful toy that is an iPhone, I have...
-
I make no claim to be a "computer scientist" or a software "engineer", those titles alone can spark some debate, I regar...
-
I saw an article (well more of a rant) the other day, by Rob Williams Brain Drain in enterprise Dev . I have to say, I do agree with some o...
-
This series of posts will be about me getting to grips with JBoss Drools . The reasoning behind it is: SAP bought out my company's curre...
-
Update: Check out my updated re-certification on the new 2019 exam... here Let me start by saying, for this certification I studied and...