Sunday, February 16, 2014

Local Wikipedia with Solr and Spring Data

Continuing with my little AI / Machine Learning research project... I wanted to have a decent sized repo of English text, that was not in a complete mess like a large percentage of data on the internet.  I figured I would try Wikipedia, but what to do with about 40Gb of XML? how do I work / query with all that data. I figured based on recent work implementation where we load something like 200 000 000 records on into a Solr cache, Solr would be the way to go, so the is an example of my basic implementation.

Required for this example:

Wikipedia download (warning it is a 9.9Gb file, extracts to about 42Gb)
Solr
Spring Data (Great Blog / Examples on Spring Data:  Petri Kainulainen's blog)

All the code and unit test for this post is on my blog GitHub Repo

When setting up Solr from scratch, you can have a look at Solr's wiki or documentation, their documentation is pretty good. There is also an example of importing Wikipedia here, I started with that and made some minor modifications.

For this specific example the Solr config needed (/conf):
For this example (and in the below config files),
Solr home: /Development/Solr
Index / Data: /Development/Data/solr_data/wikipedia
Import File: /Development/Data/enwiki-latest-pages-articles.xml

The full import into Solr took about 48 hours on my old 2011 i5 iMac and the index on my current setup is about 52Gb.

Data Config for the import:

<dataConfig>
<dataSource type="FileDataSource" encoding="UTF-8" />
<document>
<entity name="page"
processor="XPathEntityProcessor"
stream="true"
forEach="/mediawiki/page/"
url="/Development/Data/enwiki-latest-pages-articles.xml"
transformer="RegexTransformer,DateFormatTransformer"
>
<field column="id" xpath="/mediawiki/page/id" />
<field column="title" xpath="/mediawiki/page/title" />
<field column="revision" xpath="/mediawiki/page/revision/id" />
<field column="user" xpath="/mediawiki/page/revision/contributor/username" />
<field column="userId" xpath="/mediawiki/page/revision/contributor/id" />
<field column="text" xpath="/mediawiki/page/revision/text" />
<field column="timestamp" xpath="/mediawiki/page/revision/timestamp" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss'Z'" />
<field column="$skipDoc" regex="^#REDIRECT .*" replaceWith="true" sourceColName="text"/>
</entity>
</document>
</dataConfig>
view raw data-config.xml hosted with ❤ by GitHub
Schema:

<?xml version="1.0" ?>
<schema name="wikipediaCore" version="1.1">
<types>
<fieldtype name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="pint" class="solr.IntField"/>
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100"/>
<fieldType name="date" class="solr.TrieDateField" precisionStep="0" positionIncrementGap="0"/>
</types>
<fields>
<field name="id" type="string" indexed="true" stored="true" required="true"/>
<field name="title" type="string" indexed="true" stored="true"/>
<field name="revision" type="pint" indexed="false" stored="false"/>
<field name="user" type="string" indexed="false" stored="true"/>
<field name="userId" type="pint" indexed="false" stored="true"/>
<field name="text" type="text_en" indexed="true" stored="true"/>
<field name="timestamp" type="date" indexed="false" stored="true"/>
<field name="_version_" type="long" indexed="true" stored="true"/>
</fields>
<uniqueKey>id</uniqueKey>
<defaultSearchField>title</defaultSearchField>
<solrQueryParser defaultOperator="OR"/>
</schema>
view raw schema.xml hosted with ❤ by GitHub
Solr Config:

<?xml version="1.0" encoding="UTF-8" ?>
<config>
<luceneMatchVersion>4.6</luceneMatchVersion>
<lib dir="/Development/Solr/lib" regex="solr-dataimporthandler-.*\.jar" />
<directoryFactory name="DirectoryFactory" class="${solr.directoryFactory:solr.StandardDirectoryFactory}"/>
<dataDir>${solr.wikipedia.data.dir:/Development/Data/solr_data/wikipedia}</dataDir>
<schemaFactory class="ClassicIndexSchemaFactory"/>
<updateHandler class="solr.DirectUpdateHandler2">
<updateLog>
<str name="dir">${solr.wikipedia.data.dir:}</str>
</updateLog>
</updateHandler>
<requestHandler name="/get" class="solr.RealTimeGetHandler">
<lst name="defaults">
<str name="omitHeader">true</str>
</lst>
</requestHandler>
<requestHandler name="/replication" class="solr.ReplicationHandler" startup="lazy" />
<requestDispatcher handleSelect="true" >
<requestParsers enableRemoteStreaming="false" multipartUploadLimitInKB="2048" formdataUploadLimitInKB="2048" />
</requestDispatcher>
<requestHandler name="standard" class="solr.StandardRequestHandler" default="true" />
<requestHandler name="/analysis/field" startup="lazy" class="solr.FieldAnalysisRequestHandler" />
<requestHandler name="/update" class="solr.UpdateRequestHandler" />
<requestHandler name="/admin/" class="org.apache.solr.handler.admin.AdminHandlers" />
<requestHandler name="/admin/ping" class="solr.PingRequestHandler">
<lst name="invariants">
<str name="q">solrpingquery</str>
</lst>
<lst name="defaults">
<str name="echoParams">all</str>
</lst>
</requestHandler>
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</requestHandler>
<admin>
<defaultQuery>*:*</defaultQuery>
</admin>
<unlockOnStartup>true</unlockOnStartup>
</config>
view raw solrconfig.xml hosted with ❤ by GitHub
The code for this ended up being quite clean, Spring Data - Solr, gives 2 main interfaces SolrIndexService, and SolrCrudRespository, you simply extend / implement these 2, wrap that in a single interface, autowire from a Spring Java context and you good to go.

Repository:

package net.briandupreez.solr.wikipedia;
import net.briandupreez.solr.documents.WikipediaDocument;
import org.springframework.data.solr.repository.Query;
import org.springframework.data.solr.repository.SolrCrudRepository;
import org.springframework.stereotype.Repository;
import java.util.Collection;
/**
* Wikipedia repo.
* Created by Brian on 2014/01/26.
*/
@Repository
public interface WikipediaDocumentRepository extends SolrCrudRepository<WikipediaDocument, String> {
@Query("title:*?0*")
Collection<WikipediaDocument> findByTitleContains(final String title);
@Query("text:?0*")
Collection<WikipediaDocument> findByTextContains(final String text);
@Query("title:*?0* OR text:?0*")
Collection<WikipediaDocument> findByAllContains(final String text);
}
IndexService:

package net.briandupreez.solr.wikipedia;
import net.briandupreez.solr.SolrIndexService;
import net.briandupreez.solr.documents.WikipediaDocument;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.springframework.stereotype.Service;
import org.springframework.transaction.annotation.Transactional;
import javax.annotation.Resource;
/**
* Wikipedia index
* Created by Brian on 2014/01/26.
*/
@Service
public class WikipediaIndexService implements SolrIndexService<WikipediaDocument, String> {
private transient final Log logger = LogFactory.getLog(this.getClass());
@Resource
private WikipediaDocumentRepository repository;
@Transactional
@Override
public WikipediaDocument add(final WikipediaDocument entry) {
final WikipediaDocument saved = repository.save(entry);
logger.debug("Saved: " + saved);
return saved;
}
@Transactional
@Override
public void delete(final String id) {
repository.delete(id);
logger.debug("Deleted ID: " + id);
}
}
SolrService:


package net.briandupreez.solr.wikipedia;
import net.briandupreez.solr.documents.WikipediaDocument;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.springframework.stereotype.Service;
import org.springframework.transaction.annotation.Transactional;
import javax.annotation.Resource;
import java.util.Collection;
import java.util.Date;
/**
* Solr Service.
* Created by Brian on 2014/01/26.
*/
@Service
public class WikipediaSolrServiceImpl implements WikipediaSolrService {
private transient final Log logger = LogFactory.getLog(this.getClass());
@Resource
private WikipediaIndexService indexService;
@Resource
private WikipediaDocumentRepository repository;
@Transactional
@Override
public WikipediaDocument add(final String id, final String title, final String user, final String userId, final String text, final Date timestamp) {
final WikipediaDocument wikipediaDocument = new WikipediaDocument();
wikipediaDocument.setId(id);
wikipediaDocument.setTitle(title);
wikipediaDocument.setText(text);
wikipediaDocument.setUserId(userId);
wikipediaDocument.setUser(user);
wikipediaDocument.setTimestamp(timestamp);
wikipediaDocument.setAll(wikipediaDocument.toString());
return indexService.add(wikipediaDocument);
}
@Transactional
@Override
public void deleteById(final String id) {
indexService.delete(id);
}
@Transactional(readOnly = true)
@Override
public WikipediaDocument findById(final String id) {
final WikipediaDocument wikipediaDocument = repository.findOne(id);
logger.debug("FOUND: " + wikipediaDocument);
return wikipediaDocument;
}
@Transactional(readOnly = true)
@Override
public Collection<WikipediaDocument> findByTitleContains(final String title) {
return repository.findByTitleContains(title);
}
@Transactional(readOnly = true)
@Override
public Collection<WikipediaDocument> findByTextContains(final String text) {
return repository.findByTextContains(text);
}
@Transactional
@Override
public Collection<WikipediaDocument> findByAllContains(final String text) {
return repository.findByAllContains(text);
}
}
SpringContext:

package net.briandupreez.solr;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.ComponentScan;
import org.springframework.context.annotation.Configuration;
import org.springframework.context.annotation.PropertySource;
import org.springframework.core.env.Environment;
import org.springframework.data.solr.core.SolrTemplate;
import org.springframework.data.solr.repository.config.EnableSolrRepositories;
import org.springframework.data.solr.server.support.HttpSolrServerFactoryBean;
import org.springframework.transaction.PlatformTransactionManager;
import org.springframework.transaction.jta.JtaTransactionManager;
import javax.annotation.Resource;
/**
* Solr Context
* Created by Brian on 2014/01/26.
*/
@Configuration
@EnableSolrRepositories(basePackages = "net.briandupreez.solr.wikipedia")
@ComponentScan(basePackages = "net.briandupreez.solr")
@PropertySource("classpath:solr.properties")
public class SolrContext {
@Resource
private Environment environment;
/**
* Solr Factory bean
* @return factory bean
*/
@Bean
public HttpSolrServerFactoryBean solrServerFactoryBean() {
final HttpSolrServerFactoryBean factory = new HttpSolrServerFactoryBean();
factory.setUrl(environment.getRequiredProperty("solr.server.url.wiki"));
return factory;
}
/**
* The Solr Template... used in WikipediaDocumentRepository.
* @return created template
* @throws Exception error.
*/
@Bean
public SolrTemplate solrTemplate() throws Exception {
return new SolrTemplate(solrServerFactoryBean().getObject());
}
@Bean
public PlatformTransactionManager transactionManager() throws Exception {
return new JtaTransactionManager();
}
}
Next thing for me to look at for sourcing data is Spring Social.

5 comments:

Popular Posts

Followers