Chapter 6 of Programming Collective Intelligence (PCI) demonstrates how to classify documents based on their content.
I used one extra Java open source library for this chapter, and it's implementation was completely painless.
What a pleasure, simple maven include, and thats it's little file or memory based SQL based db in your code.
My full java implementation of some of the topics are available on my GitHub repo, but will highlight the Fisher Method (or Fisher's discriminant analysis or LDA) if you want to get a lot more technical.
What has made PCI a good book is it's ability to summarise quite complex theoretical and mathematical concepts down to basics and code, for us lowly developers use to practically.
To Quote:
"the Fisher method calculates the probability of a category for each feature of the document, then combines the probabilities and test to see if the set of probabilities is more or less likely than a random set. This method also returns a probability for each category that can be compared to others"
During the writing of this post, I discovered the following blog:
Shape of data
Seems well worth the read, will be spending the next couple days on that before continuing with PCI, chapter 7.. Decision Trees.
It is necessary to read more such messages.
ReplyDeleteAivivu - đại lý chuyên vé máy bay trong nước và quốc tế
ReplyDeletekinh nghiệm mua vé máy bay đi Mỹ giá rẻ
lịch bay từ mỹ về việt nam
vé máy bay từ đức về việt nam giá rẻ
đặt vé máy bay từ nga về việt nam
lịch bay từ anh về việt nam hôm nay
lịch bay từ pháp về việt nam
chuyến bay chuyên gia về việt nam
Lovely bblog you have
ReplyDelete