Required for this example:
Wikipedia download (warning it is a 9.9Gb file, extracts to about 42Gb)
Solr
Spring Data (Great Blog / Examples on Spring Data: Petri Kainulainen's blog)
All the code and unit test for this post is on my blog GitHub Repo
When setting up Solr from scratch, you can have a look at Solr's wiki or documentation, their documentation is pretty good. There is also an example of importing Wikipedia here, I started with that and made some minor modifications.
For this specific example the Solr config needed (
For this example (and in the below config files),
Solr home: /Development/Solr
Index / Data: /Development/Data/solr_data/wikipedia
Import File: /Development/Data/enwiki-latest-pages-articles.xml
The full import into Solr took about 48 hours on my old 2011 i5 iMac and the index on my current setup is about 52Gb.
Data Config for the import:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<dataConfig> | |
<dataSource type="FileDataSource" encoding="UTF-8" /> | |
<document> | |
<entity name="page" | |
processor="XPathEntityProcessor" | |
stream="true" | |
forEach="/mediawiki/page/" | |
url="/Development/Data/enwiki-latest-pages-articles.xml" | |
transformer="RegexTransformer,DateFormatTransformer" | |
> | |
<field column="id" xpath="/mediawiki/page/id" /> | |
<field column="title" xpath="/mediawiki/page/title" /> | |
<field column="revision" xpath="/mediawiki/page/revision/id" /> | |
<field column="user" xpath="/mediawiki/page/revision/contributor/username" /> | |
<field column="userId" xpath="/mediawiki/page/revision/contributor/id" /> | |
<field column="text" xpath="/mediawiki/page/revision/text" /> | |
<field column="timestamp" xpath="/mediawiki/page/revision/timestamp" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss'Z'" /> | |
<field column="$skipDoc" regex="^#REDIRECT .*" replaceWith="true" sourceColName="text"/> | |
</entity> | |
</document> | |
</dataConfig> |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<?xml version="1.0" ?> | |
<schema name="wikipediaCore" version="1.1"> | |
<types> | |
<fieldtype name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/> | |
<fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/> | |
<fieldType name="pint" class="solr.IntField"/> | |
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100"/> | |
<fieldType name="date" class="solr.TrieDateField" precisionStep="0" positionIncrementGap="0"/> | |
</types> | |
<fields> | |
<field name="id" type="string" indexed="true" stored="true" required="true"/> | |
<field name="title" type="string" indexed="true" stored="true"/> | |
<field name="revision" type="pint" indexed="false" stored="false"/> | |
<field name="user" type="string" indexed="false" stored="true"/> | |
<field name="userId" type="pint" indexed="false" stored="true"/> | |
<field name="text" type="text_en" indexed="true" stored="true"/> | |
<field name="timestamp" type="date" indexed="false" stored="true"/> | |
<field name="_version_" type="long" indexed="true" stored="true"/> | |
</fields> | |
<uniqueKey>id</uniqueKey> | |
<defaultSearchField>title</defaultSearchField> | |
<solrQueryParser defaultOperator="OR"/> | |
</schema> |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<?xml version="1.0" encoding="UTF-8" ?> | |
<config> | |
<luceneMatchVersion>4.6</luceneMatchVersion> | |
<lib dir="/Development/Solr/lib" regex="solr-dataimporthandler-.*\.jar" /> | |
<directoryFactory name="DirectoryFactory" class="${solr.directoryFactory:solr.StandardDirectoryFactory}"/> | |
<dataDir>${solr.wikipedia.data.dir:/Development/Data/solr_data/wikipedia}</dataDir> | |
<schemaFactory class="ClassicIndexSchemaFactory"/> | |
<updateHandler class="solr.DirectUpdateHandler2"> | |
<updateLog> | |
<str name="dir">${solr.wikipedia.data.dir:}</str> | |
</updateLog> | |
</updateHandler> | |
<requestHandler name="/get" class="solr.RealTimeGetHandler"> | |
<lst name="defaults"> | |
<str name="omitHeader">true</str> | |
</lst> | |
</requestHandler> | |
<requestHandler name="/replication" class="solr.ReplicationHandler" startup="lazy" /> | |
<requestDispatcher handleSelect="true" > | |
<requestParsers enableRemoteStreaming="false" multipartUploadLimitInKB="2048" formdataUploadLimitInKB="2048" /> | |
</requestDispatcher> | |
<requestHandler name="standard" class="solr.StandardRequestHandler" default="true" /> | |
<requestHandler name="/analysis/field" startup="lazy" class="solr.FieldAnalysisRequestHandler" /> | |
<requestHandler name="/update" class="solr.UpdateRequestHandler" /> | |
<requestHandler name="/admin/" class="org.apache.solr.handler.admin.AdminHandlers" /> | |
<requestHandler name="/admin/ping" class="solr.PingRequestHandler"> | |
<lst name="invariants"> | |
<str name="q">solrpingquery</str> | |
</lst> | |
<lst name="defaults"> | |
<str name="echoParams">all</str> | |
</lst> | |
</requestHandler> | |
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> | |
<lst name="defaults"> | |
<str name="config">data-config.xml</str> | |
</lst> | |
</requestHandler> | |
<admin> | |
<defaultQuery>*:*</defaultQuery> | |
</admin> | |
<unlockOnStartup>true</unlockOnStartup> | |
</config> | |
Repository:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
package net.briandupreez.solr.wikipedia; | |
import net.briandupreez.solr.documents.WikipediaDocument; | |
import org.springframework.data.solr.repository.Query; | |
import org.springframework.data.solr.repository.SolrCrudRepository; | |
import org.springframework.stereotype.Repository; | |
import java.util.Collection; | |
/** | |
* Wikipedia repo. | |
* Created by Brian on 2014/01/26. | |
*/ | |
@Repository | |
public interface WikipediaDocumentRepository extends SolrCrudRepository<WikipediaDocument, String> { | |
@Query("title:*?0*") | |
Collection<WikipediaDocument> findByTitleContains(final String title); | |
@Query("text:?0*") | |
Collection<WikipediaDocument> findByTextContains(final String text); | |
@Query("title:*?0* OR text:?0*") | |
Collection<WikipediaDocument> findByAllContains(final String text); | |
} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
package net.briandupreez.solr.wikipedia; | |
import net.briandupreez.solr.SolrIndexService; | |
import net.briandupreez.solr.documents.WikipediaDocument; | |
import org.apache.commons.logging.Log; | |
import org.apache.commons.logging.LogFactory; | |
import org.springframework.stereotype.Service; | |
import org.springframework.transaction.annotation.Transactional; | |
import javax.annotation.Resource; | |
/** | |
* Wikipedia index | |
* Created by Brian on 2014/01/26. | |
*/ | |
@Service | |
public class WikipediaIndexService implements SolrIndexService<WikipediaDocument, String> { | |
private transient final Log logger = LogFactory.getLog(this.getClass()); | |
@Resource | |
private WikipediaDocumentRepository repository; | |
@Transactional | |
@Override | |
public WikipediaDocument add(final WikipediaDocument entry) { | |
final WikipediaDocument saved = repository.save(entry); | |
logger.debug("Saved: " + saved); | |
return saved; | |
} | |
@Transactional | |
@Override | |
public void delete(final String id) { | |
repository.delete(id); | |
logger.debug("Deleted ID: " + id); | |
} | |
} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
package net.briandupreez.solr.wikipedia; | |
import net.briandupreez.solr.documents.WikipediaDocument; | |
import org.apache.commons.logging.Log; | |
import org.apache.commons.logging.LogFactory; | |
import org.springframework.stereotype.Service; | |
import org.springframework.transaction.annotation.Transactional; | |
import javax.annotation.Resource; | |
import java.util.Collection; | |
import java.util.Date; | |
/** | |
* Solr Service. | |
* Created by Brian on 2014/01/26. | |
*/ | |
@Service | |
public class WikipediaSolrServiceImpl implements WikipediaSolrService { | |
private transient final Log logger = LogFactory.getLog(this.getClass()); | |
@Resource | |
private WikipediaIndexService indexService; | |
@Resource | |
private WikipediaDocumentRepository repository; | |
@Transactional | |
@Override | |
public WikipediaDocument add(final String id, final String title, final String user, final String userId, final String text, final Date timestamp) { | |
final WikipediaDocument wikipediaDocument = new WikipediaDocument(); | |
wikipediaDocument.setId(id); | |
wikipediaDocument.setTitle(title); | |
wikipediaDocument.setText(text); | |
wikipediaDocument.setUserId(userId); | |
wikipediaDocument.setUser(user); | |
wikipediaDocument.setTimestamp(timestamp); | |
wikipediaDocument.setAll(wikipediaDocument.toString()); | |
return indexService.add(wikipediaDocument); | |
} | |
@Transactional | |
@Override | |
public void deleteById(final String id) { | |
indexService.delete(id); | |
} | |
@Transactional(readOnly = true) | |
@Override | |
public WikipediaDocument findById(final String id) { | |
final WikipediaDocument wikipediaDocument = repository.findOne(id); | |
logger.debug("FOUND: " + wikipediaDocument); | |
return wikipediaDocument; | |
} | |
@Transactional(readOnly = true) | |
@Override | |
public Collection<WikipediaDocument> findByTitleContains(final String title) { | |
return repository.findByTitleContains(title); | |
} | |
@Transactional(readOnly = true) | |
@Override | |
public Collection<WikipediaDocument> findByTextContains(final String text) { | |
return repository.findByTextContains(text); | |
} | |
@Transactional | |
@Override | |
public Collection<WikipediaDocument> findByAllContains(final String text) { | |
return repository.findByAllContains(text); | |
} | |
} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
package net.briandupreez.solr; | |
import org.springframework.context.annotation.Bean; | |
import org.springframework.context.annotation.ComponentScan; | |
import org.springframework.context.annotation.Configuration; | |
import org.springframework.context.annotation.PropertySource; | |
import org.springframework.core.env.Environment; | |
import org.springframework.data.solr.core.SolrTemplate; | |
import org.springframework.data.solr.repository.config.EnableSolrRepositories; | |
import org.springframework.data.solr.server.support.HttpSolrServerFactoryBean; | |
import org.springframework.transaction.PlatformTransactionManager; | |
import org.springframework.transaction.jta.JtaTransactionManager; | |
import javax.annotation.Resource; | |
/** | |
* Solr Context | |
* Created by Brian on 2014/01/26. | |
*/ | |
@Configuration | |
@EnableSolrRepositories(basePackages = "net.briandupreez.solr.wikipedia") | |
@ComponentScan(basePackages = "net.briandupreez.solr") | |
@PropertySource("classpath:solr.properties") | |
public class SolrContext { | |
@Resource | |
private Environment environment; | |
/** | |
* Solr Factory bean | |
* @return factory bean | |
*/ | |
@Bean | |
public HttpSolrServerFactoryBean solrServerFactoryBean() { | |
final HttpSolrServerFactoryBean factory = new HttpSolrServerFactoryBean(); | |
factory.setUrl(environment.getRequiredProperty("solr.server.url.wiki")); | |
return factory; | |
} | |
/** | |
* The Solr Template... used in WikipediaDocumentRepository. | |
* @return created template | |
* @throws Exception error. | |
*/ | |
@Bean | |
public SolrTemplate solrTemplate() throws Exception { | |
return new SolrTemplate(solrServerFactoryBean().getObject()); | |
} | |
@Bean | |
public PlatformTransactionManager transactionManager() throws Exception { | |
return new JtaTransactionManager(); | |
} | |
} |
Looks like you've done some serious research for this, very informative post for programmers especially amateur programmers like me, keep up the good work, Hope to see more soon!
ReplyDeletevery informative and knowledgeable
ReplyDeleteIt is cool that you describe.
ReplyDeleteAivivu - đại lý chuyên vé máy bay trong nước và quốc tế
ReplyDeletevé máy bay đi Mỹ giá rẻ
vé máy bay từ atlanta về việt nam
khi nào có chuyến bay từ đức về việt nam
ve may bay tu nga ve viet nam
khi nào có chuyến bay từ anh về việt nam
chuyến bay từ Paris về Hà Nội
chuyến bay chuyên gia
As always you explained very well about Wikipedia and Solr cache. We are providing Commercial Electrical Services Los Angeles CA that are reliable and trusted.
ReplyDelete