<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>William John Bert</title>
	<atom:link href="http://williamjohnbert.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://williamjohnbert.com</link>
	<description>Likes house music and sometimes being alone.</description>
	<lastBuildDate>Thu, 17 May 2012 00:39:10 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>(Relatively) quick and easy Gensim example code</title>
		<link>http://williamjohnbert.com/2012/05/relatively-quick-and-easy-gensim-example-code/</link>
		<comments>http://williamjohnbert.com/2012/05/relatively-quick-and-easy-gensim-example-code/#comments</comments>
		<pubDate>Fri, 04 May 2012 12:12:23 +0000</pubDate>
		<dc:creator>William</dc:creator>
				<category><![CDATA[Cool Stuff]]></category>
		<category><![CDATA[Interests]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[example-code]]></category>
		<category><![CDATA[gensim]]></category>
		<category><![CDATA[lsa]]></category>
		<category><![CDATA[lsi]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://williamjohnbert.com/?p=480</guid>
		<description><![CDATA[Here&#8217;s some sample code that shows the basic steps necessary to use gensim to create a corpus, train models (log entropy and latent semantic analysis), and perform semantic similarity comparisons and queries. gensim has an excellent tutorial, and this does not replace reading and understanding it. Nonetheless, this may be helpful for those interested in [...]]]></description>
			<content:encoded><![CDATA[<p>Here&#8217;s some sample code that shows the basic steps necessary to use gensim to create a corpus, train models (log entropy and latent semantic analysis), and perform semantic similarity comparisons and queries.</p>

<p><a href="http://radimrehurek.com/gensim/">gensim</a> has an excellent tutorial, and this does not replace reading and understanding it. Nonetheless, this may be helpful for those interested in doing some quick experimentation and getting their hands dirty fast. It takes you from training corpus to index and queries in about 100 lines of code, much of which is documentation.</p>

<p>Note that this code <strong>will not work out of the box</strong>. To train the models, you need to provide your own background corpus (a collection of documents, where a document can range from one sentence up to multiple pages of text). Choosing a good corpus is an art; generally, you want tens of thousands of documents that are representative of your problem domain. Like the gensim tutorial, this code also shows how to build a corpus from Wikipedia for experimentation, though note that doing so require a lot of computing time. You could potentially <a href="http://williamjohnbert.com/2012/03/how-to-install-accelerated-blas-into-a-python-virtualenv/">save hours by installing accelerated BLAS on your system</a>.</p>

<div class="highlight-wrapper python">
<div class="tools">
<div class="wrap">
<a href="#" class="show-raw">raw</a><a href="#" class="show-colored">highlighted</a><a href="#" class="to-clipboard">copy</a><a href="#" class="print">print</a><a href="#" class="about">?</a><div class="clear"></div>
</div>
<div class="clear"></div>
</div>
<pre class="raw"><code lang="python">
import logging, sys, pprint

logging.basicConfig(stream=sys.stdout, level=logging.INFO)

### Generating a training/background corpus from your own source of documents
from gensim.corpora import TextCorpus, MmCorpus, Dictionary

# gensim docs: "Provide a filename or a file-like object as input and TextCorpus will be initialized with a
# dictionary in `self.dictionary`and will support the `iter` corpus method. For other kinds of corpora, you only
# need to override `get_texts` and provide your own implementation."
background_corpus = TextCorpus(input=YOUR_CORPUS)

# Important -- save the dictionary generated by the corpus, or future operations will not be able to map results
# back to original words.
background_corpus.dictionary.save(
    "my_dict.dict")

MmCorpus.serialize("background_corpus.mm",
    background_corpus)  #  Uses numpy to persist wiki corpus in Matrix Market format. File will be several GBs.

### Generating a large training/background corpus using Wikipedia
from gensim.corpora import WikiCorpus, wikicorpus

articles = "enwiki-latest-pages-articles.xml.bz2"  # available from http://en.wikipedia.org/wiki/Wikipedia:Database_download

# This will take many hours! Output is Wikipedia in bucket-of-words (BOW) sparse matrix.
wiki_corpus = WikiCorpus(articles)
wiki_corpus.dictionary.save("wiki_dict.dict")

MmCorpus.serialize("wiki_corpus.mm", wiki_corpus)  #  File will be several GBs.

### Working with persisted corpus and dictionary
bow_corpus = MmCorpus("wiki_corpus.mm")  # Revive a corpus

dictionary = Dictionary.load("wiki_dict.dict")  # Load a dictionary

### Transformations among vector spaces
from gensim.models import LsiModel, LogEntropyModel

logent_transformation = LogEntropyModel(wiki_corpus,
    id2word=dictionary)  # Log Entropy weights frequencies of all document features in the corpus

tokenize_func = wikicorpus.tokenize  # The tokenizer used to create the Wikipedia corpus
document = "Some text to be transformed."
# First, tokenize document using the same tokenization as was used on the background corpus, and then convert it to
# BOW representation using the dictionary created when generating the background corpus.
bow_document = dictionary.doc2bow(tokenize_func(
    document))
# converts a single document to log entropy representation. document must be in the same vector space as corpus.
logent_document = logent_transformation[[
    bow_document]]

# Transform arbitrary documents by getting them into the same BOW vector space created by your training corpus
documents = ["Some iterable", "containing multiple", "documents", "..."]
bow_documents = (dictionary.doc2bow(
    tokenize_func(document)) for document in documents)  # use a generator expression because...
logent_documents = logent_transformation[
                   bow_documents]  # ...transformation is done during iteration of documents using generators, so this uses constant memory

### Chained transformations
# This builds a new corpus from iterating over documents of bow_corpus as transformed to log entropy representation.
# Will also take many hours if bow_corpus is the Wikipedia corpus created above.
logent_corpus = MmCorpus(corpus=logent_transformation[bow_corpus],
    id2word=dictionary)

# Creates LSI transformation model from log entropy corpus representation. Takes several hours with Wikipedia corpus.
lsi_transformation = LsiModel(corpus=logent_corpus, id2word=dictionary,
    num_features=400)

# Alternative way of performing same operation as above, but with implicit chaining
# lsi_transformation = LsiModel(corpus=logent_transformation[bow_corpus], id2word=dictionary,
#    num_features=400)

# Can persist transformation models, too.
logent_transformation.save("logent.model")
lsi_transformation.save("lsi.model")

### Similarities (the best part)
from gensim.similarities import Similarity

# This index corpus consists of what you want to compare future queries against
index_documents = ["A bear walked in the dark forest.",
             "Tall trees have many more leaves than short bushes.",
             "A starship may someday travel across vast reaches of space to other stars.",
             "Difference is the concept of how two or more entities are not the same."]
# A corpus can be anything, as long as iterating over it produces a representation of the corpus documents as vectors.
corpus = (dictionary.doc2bow(tokenize_func(document)) for document in index_documents)

index = Similarity(corpus=lsi_transformation[logent_transformation[corpus]], num_features=400, output_prefix="shard")

print "Index corpus:"
pprint.pprint(documents)

print "Similarities of index corpus documents to one another:"
pprint.pprint([s for s in index])

query = "In the face of ambiguity, refuse the temptation to guess."
sims_to_query = index[lsi_transformation[logent_transformation[dictionary.doc2bow(tokenize_func(query))]]]
print "Similarities of index corpus documents to '%s'" % query
pprint.pprint(sims_to_query)

best_score = max(sims_to_query)
index = sims_to_query.tolist().index(best_score)
most_similar_doc = documents[index]
print "The document most similar to the query is '%s' with a score of %.2f." % (most_similar_doc, best_score)
</code></pre>
<div class="highlighted"><table class="highlighttable"><tr>
<td class="linenos"><div class="linenodiv"><pre class="nl">  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105</pre></div></td>
<td class="code">
<div class="highlight"><pre><span class="kn">import</span> <span class="nn">logging</span><span class="o">,</span> <span class="nn">sys</span><span class="o">,</span> <span class="nn">pprint</span>

<span class="n">logging</span><span class="o">.</span><span class="n">basicConfig</span><span class="p">(</span><span class="n">stream</span><span class="o">=</span><span class="n">sys</span><span class="o">.</span><span class="n">stdout</span><span class="p">,</span> <span class="n">level</span><span class="o">=</span><span class="n">logging</span><span class="o">.</span><span class="n">INFO</span><span class="p">)</span>

<span class="c">### Generating a training/background corpus from your own source of documents</span>
<span class="kn">from</span> <span class="nn">gensim.corpora</span> <span class="kn">import</span> <span class="n">TextCorpus</span><span class="p">,</span> <span class="n">MmCorpus</span><span class="p">,</span> <span class="n">Dictionary</span>

<span class="c"># gensim docs: "Provide a filename or a file-like object as input and TextCorpus will be initialized with a</span>
<span class="c"># dictionary in `self.dictionary`and will support the `iter` corpus method. For other kinds of corpora, you only</span>
<span class="c"># need to override `get_texts` and provide your own implementation."</span>
<span class="n">background_corpus</span> <span class="o">=</span> <span class="n">TextCorpus</span><span class="p">(</span><span class="nb">input</span><span class="o">=</span><span class="n">YOUR_CORPUS</span><span class="p">)</span>

<span class="c"># Important -- save the dictionary generated by the corpus, or future operations will not be able to map results</span>
<span class="c"># back to original words.</span>
<span class="n">background_corpus</span><span class="o">.</span><span class="n">dictionary</span><span class="o">.</span><span class="n">save</span><span class="p">(</span>
    <span class="s">"my_dict.dict"</span><span class="p">)</span>

<span class="n">MmCorpus</span><span class="o">.</span><span class="n">serialize</span><span class="p">(</span><span class="s">"background_corpus.mm"</span><span class="p">,</span>
    <span class="n">background_corpus</span><span class="p">)</span>  <span class="c">#  Uses numpy to persist wiki corpus in Matrix Market format. File will be several GBs.</span>

<span class="c">### Generating a large training/background corpus using Wikipedia</span>
<span class="kn">from</span> <span class="nn">gensim.corpora</span> <span class="kn">import</span> <span class="n">WikiCorpus</span><span class="p">,</span> <span class="n">wikicorpus</span>

<span class="n">articles</span> <span class="o">=</span> <span class="s">"enwiki-latest-pages-articles.xml.bz2"</span>  <span class="c"># available from http://en.wikipedia.org/wiki/Wikipedia:Database_download</span>

<span class="c"># This will take many hours! Output is Wikipedia in bucket-of-words (BOW) sparse matrix.</span>
<span class="n">wiki_corpus</span> <span class="o">=</span> <span class="n">WikiCorpus</span><span class="p">(</span><span class="n">articles</span><span class="p">)</span>
<span class="n">wiki_corpus</span><span class="o">.</span><span class="n">dictionary</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="s">"wiki_dict.dict"</span><span class="p">)</span>

<span class="n">MmCorpus</span><span class="o">.</span><span class="n">serialize</span><span class="p">(</span><span class="s">"wiki_corpus.mm"</span><span class="p">,</span> <span class="n">wiki_corpus</span><span class="p">)</span>  <span class="c">#  File will be several GBs.</span>

<span class="c">### Working with persisted corpus and dictionary</span>
<span class="n">bow_corpus</span> <span class="o">=</span> <span class="n">MmCorpus</span><span class="p">(</span><span class="s">"wiki_corpus.mm"</span><span class="p">)</span>  <span class="c"># Revive a corpus</span>

<span class="n">dictionary</span> <span class="o">=</span> <span class="n">Dictionary</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="s">"wiki_dict.dict"</span><span class="p">)</span>  <span class="c"># Load a dictionary</span>

<span class="c">### Transformations among vector spaces</span>
<span class="kn">from</span> <span class="nn">gensim.models</span> <span class="kn">import</span> <span class="n">LsiModel</span><span class="p">,</span> <span class="n">LogEntropyModel</span>

<span class="n">logent_transformation</span> <span class="o">=</span> <span class="n">LogEntropyModel</span><span class="p">(</span><span class="n">wiki_corpus</span><span class="p">,</span>
    <span class="n">id2word</span><span class="o">=</span><span class="n">dictionary</span><span class="p">)</span>  <span class="c"># Log Entropy weights frequencies of all document features in the corpus</span>

<span class="n">tokenize_func</span> <span class="o">=</span> <span class="n">wikicorpus</span><span class="o">.</span><span class="n">tokenize</span>  <span class="c"># The tokenizer used to create the Wikipedia corpus</span>
<span class="n">document</span> <span class="o">=</span> <span class="s">"Some text to be transformed."</span>
<span class="c"># First, tokenize document using the same tokenization as was used on the background corpus, and then convert it to</span>
<span class="c"># BOW representation using the dictionary created when generating the background corpus.</span>
<span class="n">bow_document</span> <span class="o">=</span> <span class="n">dictionary</span><span class="o">.</span><span class="n">doc2bow</span><span class="p">(</span><span class="n">tokenize_func</span><span class="p">(</span>
    <span class="n">document</span><span class="p">))</span>
<span class="c"># converts a single document to log entropy representation. document must be in the same vector space as corpus.</span>
<span class="n">logent_document</span> <span class="o">=</span> <span class="n">logent_transformation</span><span class="p">[[</span>
    <span class="n">bow_document</span><span class="p">]]</span>

<span class="c"># Transform arbitrary documents by getting them into the same BOW vector space created by your training corpus</span>
<span class="n">documents</span> <span class="o">=</span> <span class="p">[</span><span class="s">"Some iterable"</span><span class="p">,</span> <span class="s">"containing multiple"</span><span class="p">,</span> <span class="s">"documents"</span><span class="p">,</span> <span class="s">"..."</span><span class="p">]</span>
<span class="n">bow_documents</span> <span class="o">=</span> <span class="p">(</span><span class="n">dictionary</span><span class="o">.</span><span class="n">doc2bow</span><span class="p">(</span>
    <span class="n">tokenize_func</span><span class="p">(</span><span class="n">document</span><span class="p">))</span> <span class="k">for</span> <span class="n">document</span> <span class="ow">in</span> <span class="n">documents</span><span class="p">)</span>  <span class="c"># use a generator expression because...</span>
<span class="n">logent_documents</span> <span class="o">=</span> <span class="n">logent_transformation</span><span class="p">[</span>
                   <span class="n">bow_documents</span><span class="p">]</span>  <span class="c"># ...transformation is done during iteration of documents using generators, so this uses constant memory</span>

<span class="c">### Chained transformations</span>
<span class="c"># This builds a new corpus from iterating over documents of bow_corpus as transformed to log entropy representation.</span>
<span class="c"># Will also take many hours if bow_corpus is the Wikipedia corpus created above.</span>
<span class="n">logent_corpus</span> <span class="o">=</span> <span class="n">MmCorpus</span><span class="p">(</span><span class="n">corpus</span><span class="o">=</span><span class="n">logent_transformation</span><span class="p">[</span><span class="n">bow_corpus</span><span class="p">],</span>
    <span class="n">id2word</span><span class="o">=</span><span class="n">dictionary</span><span class="p">)</span>

<span class="c"># Creates LSI transformation model from log entropy corpus representation. Takes several hours with Wikipedia corpus.</span>
<span class="n">lsi_transformation</span> <span class="o">=</span> <span class="n">LsiModel</span><span class="p">(</span><span class="n">corpus</span><span class="o">=</span><span class="n">logent_corpus</span><span class="p">,</span> <span class="n">id2word</span><span class="o">=</span><span class="n">dictionary</span><span class="p">,</span>
    <span class="n">num_features</span><span class="o">=</span><span class="mi">400</span><span class="p">)</span>

<span class="c"># Alternative way of performing same operation as above, but with implicit chaining</span>
<span class="c"># lsi_transformation = LsiModel(corpus=logent_transformation[bow_corpus], id2word=dictionary,</span>
<span class="c">#    num_features=400)</span>

<span class="c"># Can persist transformation models, too.</span>
<span class="n">logent_transformation</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="s">"logent.model"</span><span class="p">)</span>
<span class="n">lsi_transformation</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="s">"lsi.model"</span><span class="p">)</span>

<span class="c">### Similarities (the best part)</span>
<span class="kn">from</span> <span class="nn">gensim.similarities</span> <span class="kn">import</span> <span class="n">Similarity</span>

<span class="c"># This index corpus consists of what you want to compare future queries against</span>
<span class="n">index_documents</span> <span class="o">=</span> <span class="p">[</span><span class="s">"A bear walked in the dark forest."</span><span class="p">,</span>
             <span class="s">"Tall trees have many more leaves than short bushes."</span><span class="p">,</span>
             <span class="s">"A starship may someday travel across vast reaches of space to other stars."</span><span class="p">,</span>
             <span class="s">"Difference is the concept of how two or more entities are not the same."</span><span class="p">]</span>
<span class="c"># A corpus can be anything, as long as iterating over it produces a representation of the corpus documents as vectors.</span>
<span class="n">corpus</span> <span class="o">=</span> <span class="p">(</span><span class="n">dictionary</span><span class="o">.</span><span class="n">doc2bow</span><span class="p">(</span><span class="n">tokenize_func</span><span class="p">(</span><span class="n">document</span><span class="p">))</span> <span class="k">for</span> <span class="n">document</span> <span class="ow">in</span> <span class="n">index_documents</span><span class="p">)</span>

<span class="n">index</span> <span class="o">=</span> <span class="n">Similarity</span><span class="p">(</span><span class="n">corpus</span><span class="o">=</span><span class="n">lsi_transformation</span><span class="p">[</span><span class="n">logent_transformation</span><span class="p">[</span><span class="n">corpus</span><span class="p">]],</span> <span class="n">num_features</span><span class="o">=</span><span class="mi">400</span><span class="p">,</span> <span class="n">output_prefix</span><span class="o">=</span><span class="s">"shard"</span><span class="p">)</span>

<span class="k">print</span> <span class="s">"Index corpus:"</span>
<span class="n">pprint</span><span class="o">.</span><span class="n">pprint</span><span class="p">(</span><span class="n">documents</span><span class="p">)</span>

<span class="k">print</span> <span class="s">"Similarities of index corpus documents to one another:"</span>
<span class="n">pprint</span><span class="o">.</span><span class="n">pprint</span><span class="p">([</span><span class="n">s</span> <span class="k">for</span> <span class="n">s</span> <span class="ow">in</span> <span class="n">index</span><span class="p">])</span>

<span class="n">query</span> <span class="o">=</span> <span class="s">"In the face of ambiguity, refuse the temptation to guess."</span>
<span class="n">sims_to_query</span> <span class="o">=</span> <span class="n">index</span><span class="p">[</span><span class="n">lsi_transformation</span><span class="p">[</span><span class="n">logent_transformation</span><span class="p">[</span><span class="n">dictionary</span><span class="o">.</span><span class="n">doc2bow</span><span class="p">(</span><span class="n">tokenize_func</span><span class="p">(</span><span class="n">query</span><span class="p">))]]]</span>
<span class="k">print</span> <span class="s">"Similarities of index corpus documents to '</span><span class="si">%s</span><span class="s">'"</span> <span class="o">%</span> <span class="n">query</span>
<span class="n">pprint</span><span class="o">.</span><span class="n">pprint</span><span class="p">(</span><span class="n">sims_to_query</span><span class="p">)</span>

<span class="n">best_score</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="n">sims_to_query</span><span class="p">)</span>
<span class="n">index</span> <span class="o">=</span> <span class="n">sims_to_query</span><span class="o">.</span><span class="n">tolist</span><span class="p">()</span><span class="o">.</span><span class="n">index</span><span class="p">(</span><span class="n">best_score</span><span class="p">)</span>
<span class="n">most_similar_doc</span> <span class="o">=</span> <span class="n">documents</span><span class="p">[</span><span class="n">index</span><span class="p">]</span>
<span class="k">print</span> <span class="s">"The document most similar to the query is '</span><span class="si">%s</span><span class="s">' with a score of </span><span class="si">%.2f</span><span class="s">."</span> <span class="o">%</span> <span class="p">(</span><span class="n">most_similar_doc</span><span class="p">,</span> <span class="n">best_score</span><span class="p">)</span>
</pre></div>
</td>
</tr></table></div>
</div>
]]></content:encoded>
			<wfw:commentRss>http://williamjohnbert.com/2012/05/relatively-quick-and-easy-gensim-example-code/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>An Introduction to gensim: &#8220;Topic Modelling for Humans&#8221;</title>
		<link>http://williamjohnbert.com/2012/05/an-introduction-to-gensim-topic-modelling-for-humans/</link>
		<comments>http://williamjohnbert.com/2012/05/an-introduction-to-gensim-topic-modelling-for-humans/#comments</comments>
		<pubDate>Thu, 03 May 2012 18:06:02 +0000</pubDate>
		<dc:creator>William</dc:creator>
				<category><![CDATA[Cool Stuff]]></category>
		<category><![CDATA[Interests]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[clustering]]></category>
		<category><![CDATA[code]]></category>
		<category><![CDATA[dc-python]]></category>
		<category><![CDATA[gensim]]></category>
		<category><![CDATA[lsa]]></category>
		<category><![CDATA[lsi]]></category>
		<category><![CDATA[presentation]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[similarity]]></category>
		<category><![CDATA[visularity]]></category>

		<guid isPermaLink="false">http://williamjohnbert.com/?p=472</guid>
		<description><![CDATA[On Tuesday, I presented at the monthly DC Python meetup. My talk was an introduction to gensim, a free Python framework for topic modelling and semantic similarity using LSA/LSI and other statistical techniques. I&#8217;ve been using gensim on and off for several months at work, and I really appreciate its performance, clean API design, documentation, [...]]]></description>
			<content:encoded><![CDATA[<p>On Tuesday, I presented at the monthly DC Python meetup. My talk was an introduction to gensim, a free Python framework for topic modelling and semantic similarity using LSA/LSI and other statistical techniques. I&#8217;ve been using gensim on and off for several months at work, and I really appreciate its performance, clean API design, documentation, and community. (All of this is due to its creator, Radim Rehurek, who I interviewed recently.) </p>

<p>The presentation slides are <a href="http://www.slideshare.net/sandinmyjoints/an-introduction-to-gensim-topic-modelling-for-humans">available here</a>. I also wrote some <a href="http://williamjohnbert.com/2012/05/relatively-quick-and-easy-gensim-example-code/">quick gensim example code</a> that walks through creating a corpus, generating and transforming models, and using models to do semantic similarity. The code and slides are both also available on my <a href="https://github.com/sandinmyjoints/gensimtalk">github account</a>.</p>

<p>Finally, I also developed a <a href="http://github.com/sandinmyjoints/visularity">demo app to visualize semantic similarity queries</a>. It&#8217;s a Flask web app, with gensim generating data on the backend that is clustered by scipy and scikit-learn and visualized by d3.js as agglomerative and hierarchical clusters as well as a simple table and dendrogram. To make it all work in realtime, I used threading and hookbox. I call it Visularity, and it&#8217;s <a href="http://github.com/sandinmyjoints/visularity">available on github</a>. You need to provide your own model and dictionary data to use&#8211;check out my presentation and visit <a href="http://radimrehurek.com/gensim">radimrehurek.com/gensim/</a> to learn how. Comments and feedback welcome!</p>
]]></content:encoded>
			<wfw:commentRss>http://williamjohnbert.com/2012/05/an-introduction-to-gensim-topic-modelling-for-humans/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Interview with Radim Rehurek, creator of gensim</title>
		<link>http://williamjohnbert.com/2012/04/interview-with-radim-rehurek-creator-of-gensim/</link>
		<comments>http://williamjohnbert.com/2012/04/interview-with-radim-rehurek-creator-of-gensim/#comments</comments>
		<pubDate>Mon, 30 Apr 2012 16:58:42 +0000</pubDate>
		<dc:creator>William</dc:creator>
				<category><![CDATA[Cool Stuff]]></category>
		<category><![CDATA[Interests]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[gensim]]></category>
		<category><![CDATA[lda]]></category>
		<category><![CDATA[lsi]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[semantic-similarity]]></category>
		<category><![CDATA[similarity]]></category>
		<category><![CDATA[topic-modeling]]></category>

		<guid isPermaLink="false">http://williamjohnbert.com/?p=465</guid>
		<description><![CDATA[Tomorrow at the May 2012 DC Python meetup, I&#8217;m giving a talk on gensim, a Python framework for topic modeling that I use at work and on my own for semantic similarity comparisons. (I&#8217;ll post the slides and example code for the talk soon.) I&#8217;ve found gensim to be a useful and well-designed tool, and [...]]]></description>
			<content:encoded><![CDATA[<p>Tomorrow at the <a href="http://meetup.dcpython.org/events/23832731/">May 2012 DC Python meetup</a>, I&#8217;m giving a talk on <a href="http://radimrehurek.com/gensim/">gensim</a>, a Python framework for topic modeling that I use at work and on my own for semantic similarity comparisons. (I&#8217;ll post the slides and example code for the talk soon.) I&#8217;ve found gensim to be a useful and well-designed tool, and pretty much all credit for it goes to its creator, Radim Rehurek. Radim was kind enough to answer a few questions I sent him about gensim&#8217;s history and goals, and about his background and interests.</p>

<p><strong>WB: Why did you create gensim?</strong></p>

<p>RR: Consulting gig for a digital library project (Czech Digital
Mathematics Library, dml.cz), some 3 years ago. It started off as a
few loosely connected Python scripts to support the &#8220;show similar
articles&#8221; functionality. We wanted to use some of the statistical
methods, like latent semantic analysis. Originally, gensim only
contained wrappers around existing Fortran libraries for SVD, like
Propack and Svdpack.</p>

<p>But there were issues with that, and it scaled badly (all documents in
RAM), so I started looking for more scalable, online algorithms.
Running these popular methods shouldn&#8217;t be so hard, I thought!</p>

<p>In the end, I developed new algorithms for these methods for gensim.
The theoretical part of this research later turned into a part of my
PhD thesis.</p>

<p><strong>Who is using gensim (as far as you know)&#8211;academics, hobbyists, commercial entities, a mixture? Any particularly interesting uses?</strong></p>

<p>Yes, I&#8217;ve heard from many academic as well as commercial
organizations, both on the mailing list and off. Off the top of my
head: ravn.co.uk, roistr.com, sportsauthority.com, larkc.eu; TU of
Denmark, U of Stuttgart, Masaryk U, U of Ghent, some people used it in
the Yahoo! KD cup competition&#8230; But what they all did with gensim, or
whether they still use it, I don&#8217;t know. The gensim license (LGPL) is
pretty liberal in that respect.</p>

<p>Unfortunately, all this use rarely translates into any feedback or
contributions. I guess I&#8217;m just not very good at the
bring-new-developers-and-grow-open-source stuff <img src='http://williamjohnbert.com/wp-includes/images/smilies/icon_sad.gif' alt=':(' class='wp-smiley' /> </p>

<p><strong>Roughly how much of the current codebase was written by you, and how much by contributors?</strong></p>

<p>Almost everything by me, but I am very grateful for bug fixes and
patches. I try to put every contribution from other people into the
changelog: https://github.com/piskvorky/gensim/blob/develop/CHANGELOG.txt
. I made some wiki pages to make contributing easier:
https://github.com/piskvorky/gensim/wiki . I also try to answer
general questions on the mailing list.</p>

<p><strong>What are your favorite features, or parts of the code that you&#8217;re most proud of?</strong></p>

<p>I don&#8217;t have emotional attachments to parts of the code &#8212; if it&#8217;s
bad, it needs to go. I guess the most proven parts are the ones that
had been around for the longest &#8212; LSA etc. Things that were
contributed recently by other people, like the new HDP (hierarchical
dirichlet process) code, or the <code>gensim.parsing</code> subpackage, are the
most rough around the edges.</p>

<p>The best feature is the memory independence for sure. Most
implementations of the statistical semantics methods assume the
training data resides in RAM, which limits their use to small/medium
corpora. Also they work in batch mode, needing a full re-train when
new training data arrives. The LSA/LDA algos are online though (can be
updated with new data, incrementally).</p>

<p><strong>What&#8217;s your background? Academic, software engineering, both?</strong></p>

<p>I finished my PhD, but I feel more like a software engineer than a
pure researcher. Even during my academic years, I was working in IT
commerce. I wouldn&#8217;t like to stay in academia professionally.</p>

<p><strong>What are you working on next for gensim? What about outside of gensim?</strong></p>

<p>Small things like adding the &#8220;hashing trick&#8221; etc:
https://github.com/piskvorky/gensim/issues . Basically things that
gensim users have been asking for. Some issues keep coming back on the
mailing list, and while not technically bugs, they hint at minor
redesigns and improvements.</p>

<p>One big thing that is missing is a basic visual style for gensim. I
have no clue how to do that and it&#8217;s really pathetic gensim doesn&#8217;t
even have a logo yet!</p>

<p>Outside of gensim, I am busy doing consulting (scaling up text
processing: fulltext search, semantic search, ad targeting etc &#8211;
backend stuff). I&#8217;m planning to do a startup that offers semantic
search and similarity as a service. A kind of easy-to-use black box
tool, something like searchify or myrrix. But it&#8217;s hard to find good
people to work with&#8230; and hard to give up/interrupt a well-paying
career <img src='http://williamjohnbert.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  I applied for YC last month, alone, but they turned me down.</p>
]]></content:encoded>
			<wfw:commentRss>http://williamjohnbert.com/2012/04/interview-with-radim-rehurek-creator-of-gensim/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>ExtJS TreeStore trouble with nested nodes</title>
		<link>http://williamjohnbert.com/2012/04/extjs-treestore-trouble-with-nested-nodes/</link>
		<comments>http://williamjohnbert.com/2012/04/extjs-treestore-trouble-with-nested-nodes/#comments</comments>
		<pubDate>Thu, 19 Apr 2012 14:23:07 +0000</pubDate>
		<dc:creator>William</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Web Development]]></category>
		<category><![CDATA[javascript extjs workaround]]></category>

		<guid isPermaLink="false">http://williamjohnbert.com/?p=455</guid>
		<description><![CDATA[At work, we&#8217;re building an app to edit objects in a database&#8211;a classic CRUD application. For now, we&#8217;re trying out ExtJS as the client-side UI framework. One of the use cases is selecting and editing nested objects, represented in our relational database with foreign keys. Let&#8217;s call the root object a Task, which consists of [...]]]></description>
			<content:encoded><![CDATA[<p>At work, we&#8217;re building an app to edit objects in a database&#8211;a classic CRUD application. For now, we&#8217;re trying out ExtJS as the client-side UI framework. One of the use cases is selecting and editing nested objects, represented in our relational database with foreign keys. Let&#8217;s call the root object a Task, which consists of nested Goals, which have Steps. Each of those is defined by a model on the backend that is more or less mimicked by an Ext.data.Model on the client-side, and each model has a proxy to a RESTful endpoint on the backend for create/retrieve/update/delete operations. We want to use an Ext.tree.TreePanel for the UI, so we hold the data in an Ext.data.TreeStore. So far so good. </p>

<p>We coded up our prototype, but when a user selects a Task, Ext JS throws this error: <code>Uncaught TypeError: Cannot read property 'internalId' of undefined</code>. Hmm. Everything seems to be working. Our models are loading the correct data. No obvious bugs. A lot of inspecting and googling and reading documentation later, I discover <a href="http://www.sencha.com/forum/archive/index.php/t-160068.html?s=03fb3a67ebf1e1ef856bc5f277ad12e8" title="this thread">this thread</a>. The key quote: </p>

<blockquote>It doesn&#8217;t matter if the [model] ids are unique within the JSON [or any data]. It must be unique within the tree. 

If you add the first json to the tree with for example the id 4_1 and you add the second json with again a node 4_1 then you have two nodes with the same id. </blockquote>

<p>In other words, TreeStore doesn&#8217;t distinguish the types of roots and their children (or children&#8217;s children, etc). To TreeStore, they are <strong>all</strong> nodes, and ids must be unique across all nodes. If you have an instance of a Task model with id=1 and it has a foreign key to a Goal that also has id=1, TreeStore has a problem with that. Apparently it doesn&#8217;t introspect the objects enough to see that, say, one is a Task and its children are Goals, despite the Task model having a <code>hasMany</code> field that defines its relation to the Goal model. That seems counterintuitive to me, maybe even misleading. Perhaps that&#8217;s why we&#8217;re not the only ones who&#8217;ve <a href="http://www.sencha.com/forum/showthread.php?129524-CLOSED-Selection-of-Association-in-DataView">had</a> <a href="http://www.sencha.com/forum/showthread.php?135285-TreeStore-Model-and-quot-id-quot-field">this</a> <a href="http://www.sencha.com/forum/showthread.php?196396-How-to-add-children-tree-nodes-dynamically">problem</a>. </p>

<p>My quick fix was to write a <code>stringify_id()</code> function to wrap ids passed to the TreeStore with a prefix unique to each type, so the id of Task id=1 becomes &#8220;task-1&#8243;. <code>destringify_id()</code> unwraps the ids that come back through the proxy. </p>

<p>TreeStore&#8217;s <a href="http://docs.sencha.com/ext-js/4-0/#!/api/Ext.data.TreeStore">docs</a> do not mention this restriction, as far as I can tell. Maybe if you purchase Ext JS, you get better docs, I&#8217;m not sure. We may be doing just that, so I could have a chance to find out. One of the complaints you sometimes hear about open source is that the docs aren&#8217;t that great, so I&#8217;m curious to see how a for-profit company&#8217;s docs stack up against the documentation culture of the communities I&#8217;m most familiar with (Python and Django), which tend to be pretty solid.</p>
]]></content:encoded>
			<wfw:commentRss>http://williamjohnbert.com/2012/04/extjs-treestore-trouble-with-nested-nodes/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Fake bio for Steve</title>
		<link>http://williamjohnbert.com/2012/04/fake-bio-for-steve/</link>
		<comments>http://williamjohnbert.com/2012/04/fake-bio-for-steve/#comments</comments>
		<pubDate>Fri, 06 Apr 2012 17:53:07 +0000</pubDate>
		<dc:creator>William</dc:creator>
				<category><![CDATA[Cool Stuff]]></category>
		<category><![CDATA[Interests]]></category>
		<category><![CDATA[826dc lowercase plagiarism wikipedia writing]]></category>

		<guid isPermaLink="false">http://williamjohnbert.com/?p=444</guid>
		<description><![CDATA[My good friend Steve has hosted the lowercase, the monthly reading series associated with 826DC, for three years. Steve has a charming habit of introducing his readers with made-up bios, so in his honor, I asked some lowercase regulars to write fake bios of him and share them at the third anniversary reading on April [...]]]></description>
			<content:encoded><![CDATA[<p>My good friend Steve has hosted <a href="http://826dc.org/?p=3336" title="the lowercase">the lowercase</a>, the monthly reading series associated with <a href="http://826dc.org/" title="826DC">826DC</a>, for three years. Steve has a charming habit of introducing his readers with made-up bios, so in his honor, I asked some lowercase regulars to write fake bios of him and share them at the third anniversary reading on April 4. The results were highly entertaining; thanks to everyone who wrote one! </p>

<p>Here&#8217;s mine:</p>

<blockquote>Steve Souryal is a group of 15 small islets and rocks in the central equatorial Atlantic Ocean. He lies in the Intertropical Convergence Zone, a region of severe storms. Steve exposes serpentinized abyssal mantle peridotite and kaersutite-bearing ultramafic mylonite on the top of the second-largest megamullion in the world (after the Parece Vela megamullion under Okinotoshima in the Pacific). He is the only location in the Atlantic Ocean where the abyssal mantle is exposed above sea level! In 1986, Steve was designated an environmentally protected area, and since 1998, the Danish Navy has maintained a permanently manned research facility in him. His main economic activity is tuna fishing, and we are incredibly lucky to have him with us tonight.
</blockquote>

<p>Apologies to Wikipedia. But somehow, it just feels right.</p>
]]></content:encoded>
			<wfw:commentRss>http://williamjohnbert.com/2012/04/fake-bio-for-steve/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How to install accelerated BLAS into a Python virtualenv</title>
		<link>http://williamjohnbert.com/2012/03/how-to-install-accelerated-blas-into-a-python-virtualenv/</link>
		<comments>http://williamjohnbert.com/2012/03/how-to-install-accelerated-blas-into-a-python-virtualenv/#comments</comments>
		<pubDate>Sat, 24 Mar 2012 00:43:33 +0000</pubDate>
		<dc:creator>William</dc:creator>
				<category><![CDATA[Web Development]]></category>
		<category><![CDATA[python virtualenv blas ubuntu]]></category>

		<guid isPermaLink="false">http://williamjohnbert.com/?p=426</guid>
		<description><![CDATA[Background Some mathematically intense operations that use Numpy/Scipy can run faster with accelerated basic linear algebra subroutine (BLAS) libraries installed on your system (e.g., gensim&#8217;s corpus processing). To see what BLAS libraries you are using, do: python -c 'import numpy; numpy.show_config()' If none of them are installed, you probably want to install one or more. [...]]]></description>
			<content:encoded><![CDATA[<h1>Background</h1>

<p>Some mathematically intense operations that use Numpy/Scipy can run faster with accelerated basic linear algebra subroutine (BLAS) libraries installed on your system (e.g., <a href="http://radimrehurek.com/gensim/">gensim&#8217;s</a> corpus processing). </p>

<p>To see what BLAS libraries you are using, do:</p>

<pre><code>python -c 'import numpy; numpy.show_config()'
</code></pre>

<p>If none of them are installed, you probably want to install one or
more. <a href="http://math-atlas.sourceforge.net/">ATLAS</a> is always a good bet, since it&#8217;s portable and
self-optimizing. There are others out there targeted at particular CPU architectures.</p>

<p>Unfortunately, the <a href="http://docs.scipy.org/doc/numpy/user/install.html">Scipy docs</a> are out of date regarding installing accelerated BLAS libraries on Ubuntu. The instructions I have written below work for Ubuntu 10.04, the current LTS (long-term support) version, and though I haven&#8217;t tried to run them on a more recent version, it&#8217;s possible they work with those as well.</p>

<h1>Prereqs</h1>

<p>On Ubuntu 10.04, and possibly other versions, you need liblapack-dev and gfortran (yes, fortran):</p>

<pre><code>sudo apt-get install liblapack-dev
sudo apt-get install gfortran
</code></pre>

<h1>Instructions</h1>

<p>Install the accelerated linear algebra libraries (ATLAS/LAPACK) in your virtualenv on Ubutu:</p>

<pre><code>#!/bin/bash
workon [envname]
pip uninstall numpy # only if numpy is already installed
pip uninstall scipy # only if scipy is already installed
export LAPACK=/usr/lib/liblapack.so
export ATLAS=/usr/lib/libatlas.so
export BLAS=/usr/lib/libblas.so
</code></pre>

<p>Now you can install numpy and scipy into the same virtualenv and be confident they will perform operations using the accelerated BLAS routines:</p>

<pre><code>pip install numpy
pip install scipy
</code></pre>
]]></content:encoded>
			<wfw:commentRss>http://williamjohnbert.com/2012/03/how-to-install-accelerated-blas-into-a-python-virtualenv/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Novelties &amp; Traditions</title>
		<link>http://williamjohnbert.com/2011/11/novelties-traditions/</link>
		<comments>http://williamjohnbert.com/2011/11/novelties-traditions/#comments</comments>
		<pubDate>Sat, 19 Nov 2011 15:28:21 +0000</pubDate>
		<dc:creator>William</dc:creator>
				<category><![CDATA[Cool Stuff]]></category>
		<category><![CDATA[Interests]]></category>
		<category><![CDATA[Publications]]></category>
		<category><![CDATA[Teaching]]></category>

		<guid isPermaLink="false">http://williamjohnbert.com/?p=394</guid>
		<description><![CDATA[Today&#8217;s the third annual Friendsgiving, a Thanksgiving-like pre-Thanksgiving event for a bunch of people who like each other; hence, Friendsgiving. Thanksgiving&#8217;s always been my favorite holiday so I&#8217;m more than happy to celebrate it twice a year. The first two Friendsgivings took place at my house, but because in the spring I traded my room [...]]]></description>
			<content:encoded><![CDATA[<p>Today&#8217;s the third annual Friendsgiving, a Thanksgiving-like pre-Thanksgiving event for a bunch of people who like each other; hence, Friendsgiving. Thanksgiving&#8217;s always been my favorite holiday so I&#8217;m more than happy to celebrate it twice a year. The first two Friendsgivings took place at my house, but because in the spring I traded my room in a cavernous and amply chandeliered group rowhouse for cozier and warmer digs, the honor of hosting this year falls to two friends who&#8217;re renting an entire lovely house for themselves up in Pleasant Plains. Sweet.</p>

<p>So much for traditions; recent novelties include starting a new job, about which more another time, but basically, I love it; and getting a lesson plan published in <a href="http://www.amzn.com/111802432X">Don&#8217;t Forget to Write</a>, the second volume of lesson plans from <a href="http://www.826national.org/">826</a>. The lesson plan, &#8220;Busted,&#8221; aims to make storytellers out of middle schoolers by having them write about a time they got caught doing something they shouldn&#8217;t have been doing&#8211;a theme first cooked up by the folks who led the <a href="http://826dc.org/?p=510">Get Used to the Seats</a> book project. I <a href="http://williamjohnbert.com/2010/11/caught-in-the-act-part-3/">wrote about leading the workshops that ultimately became the &#8220;Busted&#8221; lesson plan</a> more than a year ago&#8211;right around the previous Friendsgiving. Hard to believe it&#8217;s been that long!</p>
]]></content:encoded>
			<wfw:commentRss>http://williamjohnbert.com/2011/11/novelties-traditions/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Gender, programming, and the power of language</title>
		<link>http://williamjohnbert.com/2011/08/gender-programming-and-the-power-of-language/</link>
		<comments>http://williamjohnbert.com/2011/08/gender-programming-and-the-power-of-language/#comments</comments>
		<pubDate>Sun, 28 Aug 2011 20:15:02 +0000</pubDate>
		<dc:creator>William</dc:creator>
				<category><![CDATA[Cool Stuff]]></category>
		<category><![CDATA[Other]]></category>
		<category><![CDATA[gender]]></category>
		<category><![CDATA[language]]></category>
		<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://williamjohnbert.com/?p=385</guid>
		<description><![CDATA[An interlude from the recent trend of hardcore Django action: &#62; When I spoke with a female intern this summer, she recounted how, in 2006, the &#62; GNOME Project, a free and open source software project, received almost 200 Google &#62; Summer of Code applicants. All of them were male. When GNOME advertised an &#62; [...]]]></description>
			<content:encoded><![CDATA[<p>An interlude from the recent trend of hardcore Django action:</p>

<p>&gt; When I spoke with a female intern this summer, she recounted how, in 2006, the 
&gt; GNOME Project, a free and open source software project, received almost 200 Google 
&gt; Summer of Code applicants. All of them were male. When GNOME advertised an 
&gt; identical program for women, emphasizing opportunities for learning and mentorship 
&gt; instead of tough competition, they received more than 100 highly qualified female 
&gt; applicants for the three spots they were able to fund. What amazed me even more was 
&gt; when she suggested that our own company slogan — “We Help the World’s Best 
&gt; Developers Make Better Software” — might alienate prospective female candidates. 
&gt; That had never occurred to me. But according to our intern, in the world of 
&gt; computer science, “when you hear the phrase ‘the world’s best developers,’ you see 
&gt; a guy.”</p>

<p>From <a href="http://www.washingtonpost.com/opinions/when-computer-programming-was-womens-work/2011/08/24/gIQAdixGgJ_print.html">When computer programming was ‘women’s work’</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://williamjohnbert.com/2011/08/gender-programming-and-the-power-of-language/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>django-social-auth: Installing and troubleshooting</title>
		<link>http://williamjohnbert.com/2011/08/django-social-auth-installing-and-troubleshooting/</link>
		<comments>http://williamjohnbert.com/2011/08/django-social-auth-installing-and-troubleshooting/#comments</comments>
		<pubDate>Fri, 26 Aug 2011 15:16:14 +0000</pubDate>
		<dc:creator>William</dc:creator>
				<category><![CDATA[Cool Stuff]]></category>
		<category><![CDATA[Projects]]></category>
		<category><![CDATA[Web Development]]></category>

		<guid isPermaLink="false">http://williamjohnbert.com/?p=374</guid>
		<description><![CDATA[Thanks to django-registration, I was able to build a working account registration/login system pretty easily. But I wanted to give users the ability to use their existing accounts through popular services such as Facebook, Twitter, etc., rather than have to create yet another account. Here&#8217;s how I did it. Sorting Through the Choices There are [...]]]></description>
			<content:encoded><![CDATA[<p>Thanks to <code>django-registration</code>, I was able to build a working account registration/login system pretty easily. But I wanted to give users the ability to use their existing accounts through popular services such as Facebook, Twitter, etc., rather than have to create yet another account. Here&#8217;s how I did it.</p>

<h1>Sorting Through the Choices</h1>

<p>There are a number of reusable Django apps out there to help with registration/login from social media sites. I found this <a href="http://hackerluddite.wordpress.com/2011/05/17/review-of-4-django-social-auth-apps/">Review of 4 Django Social Auth apps</a> very helpful in sorting out the options. After reading it, I was left to choose between <a href="https://github.com/omab/django-social-auth"><code>django-social-auth</code></a> (I originally linked to the wrong app here, but this link is correct) and <a href="https://github.com/pennersr/django-allauth"><code>django-allauth</code></a>. In the end, I went with <code>django-social-auth</code> (not to be confused with <code>django-socialauth</code>) because a friend had recommended it and because I&#8217;d already installed it before I read this article. However, the article&#8217;s conclusion that <code>django-allauth</code> is best out of the box also seems valid.</p>

<h1>Installation</h1>

<p>The instructions in <a href="http://django-social-auth.readthedocs.org/en/latest/"><code>django-social-auth</code>&#8216;s docs</a> are helpful in walking you through available settings and options. </p>

<p>I also found the included example app useful. To use this app, I cloned <code>django-social-auth</code>&#8216;s git repo, created a virtualenv called <code>django-social-auth</code>, ran <code>pip install -r requirements.txt</code> inside this virtualenv to install all the required apps, ran <code>manage.py syncdb</code>, and finally ran <code>manage.py runserver</code>. Voila, example app is up and running at 127.0.0.1, showing a simple screen with options to login through about a dozen different different services.</p>

<h1>API Keys</h1>

<p>The first service I tested was Twitter. I use it more than any others, and I already had the API keys for it. I threw my API key and secret key into the example <code>local_settings.py</code> file provided with <code>django-social-auth</code> and tried to log in via the example app. Boom: <code>401 Unauthorized</code>. I double-checked all my settings and installation and whatnot. Seemed fine. </p>

<p>I turned my attention to the API keys. The ones I had were generated for <a href="http://www.readsrs.com">Readsr</a>, i.e., I entered readsrs.com as the domain when I generated them at dev.twitter.com. But now I was running on localhost, 127.0.0.1, so I suspected the readsrs.com keys wouldn&#8217;t be valid. I wasn&#8217;t sure whether Twitter would hand over a new consumer key for 127.0.0.1, or baulk at the request. (It seemed like it should do so, but I hadn&#8217;t seen any instructions anywhere that said to get a key for your development machine.) Turns out Twitter will happily give you a key for 127.0.0.1. Once I plugged the new keys in, I was able to log in with my Twitter credentials, and just as it should, <code>django-social-auth</code> automatically created an <code>auth.user</code> for this account. </p>

<h1>Integrating with Readsr</h1>

<p>I followed the instructions again to config my own app, Readsr. To add a login option using Twitter credentials, I put a link to the reversed view that begins the <code>django-social-auth</code> login process for twitter, i.e., <code>{% url socialauth_begin "twitter" %}</code>, to my login template. And it worked.</p>

<p>I still need to fix a few oddities. For example, Twitter returns my first and last names together in <code>first_name</code> (or else <code>django-social-auth</code> is concatenating them into that column), and doesn&#8217;t supply any email address. But the basic functionality is there, and was relatively easy to achieve.</p>

<h1>Postscript</h1>

<p>The author of the article I linked above had an error using OpenID when using <code>django-social-auth</code>, which is why he preferred <code>django-authall</code>. He filed a bug for the error he got, and I notice that <a href="https://github.com/omab/django-social-auth/issues/67">it was closed</a> 15 hours ago (though if you read the comments, it seems it was actually fixed back in mid-July). Good timing.</p>
]]></content:encoded>
			<wfw:commentRss>http://williamjohnbert.com/2011/08/django-social-auth-installing-and-troubleshooting/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>How to: Unit testing in Django with mocking and patching</title>
		<link>http://williamjohnbert.com/2011/07/how-to-unit-testing-in-django-with-mocking-and-patching/</link>
		<comments>http://williamjohnbert.com/2011/07/how-to-unit-testing-in-django-with-mocking-and-patching/#comments</comments>
		<pubDate>Fri, 08 Jul 2011 14:56:29 +0000</pubDate>
		<dc:creator>William</dc:creator>
				<category><![CDATA[Web Development]]></category>
		<category><![CDATA[datetime]]></category>
		<category><![CDATA[django]]></category>
		<category><![CDATA[mocking]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[unit testing]]></category>

		<guid isPermaLink="false">http://williamjohnbert.com/?p=346</guid>
		<description><![CDATA[Background For Readsr, I need to track events that recur on a particular day of the week (e.g., first Sunday of the month, third Friday of the month). I created a DayOfWeek model to store any particular event&#8217;s day of the week. It contains a method next&#95;day&#95;of&#95;week() to return a datetime.date object set to the [...]]]></description>
			<content:encoded><![CDATA[<h3>Background</h3>

<p>For <a href="www.readsrs.com">Readsr</a>, I need to track events that recur on a particular day of the week (e.g., first Sunday of the month, third Friday of the month). I created a DayOfWeek model to store any particular event&#8217;s day of the week. It contains a method next&#95;day&#95;of&#95;week() to return a datetime.date object set to the next occurrence of whatever weekday a given event instance is set to (this helps with figuring out when the next occurrence of an event is).</p>

<p>It&#8217;s easier to show through an example. On Sunday 7/3/2011:</p>

<ul>
<li>For an object with DayOfWeek set to Sunday, next&#95;day&#95;of&#95;week() would return 7/3/2011 (current day).</li>
<li>For DayOfWeek set to Monday, it would return 7/4/2011 (first subsequent Monday).</li>
<li>For DayOfWeek set to Saturday, it would return 7/9/2011 (first subsequent Saturday).</li>
</ul>

<p>Sounds simple enough. It seemed like this would be a good place to do my first unit tests.</p>

<h3>Unit Testing</h3>

<p>To do unit testing, the typical method is to first write test cases and then write code. In this case, I&#8217;d already written my code, so I went back and wrote test cases, trying to forget how my code worked. </p>

<p>To write test cases, you have to detail requirements for each method you want to test: input and expected (correct) output. The list of examples for 
next&#95;day&#95;of&#95;week() I wrote above works for this purpose. But there&#8217;s a catch: next&#95;day&#95;of&#95;week() calculates the next day of the week relative to the current date, by calling datetime.date.today(). So if I write expected output for 7/3/2011, it will no longer be the correct output on 7/4/2011 or any following day. I needed a way to make datetime.date.today() always spit out my input date when I run tests, yet still continue to function normally outside of testing. Enter mocking.</p>

<h3>Mocking</h3>

<p>The solution was to mock out the method—to replace the real datetime.date.today() with a fake one that produces the same output no matter what day it is. To accomplish this, I used the powerful <a href="http://www.voidspace.org.uk/python/mock/">Mock library</a>. Specifically, I needed to use the patch decorator. This decorator makes it really easy to replace on particular object within the scope of a particular method. </p>

<p>Before I could patch the today() method, I needed to create my own fake method. It would look like this:</p>


<div class="my_syntax_box"><span class="my_syntax_selecall"><a href="javascript:;" onclick="selectCode(this); return false;">Selec All</a> </span><span class="my_syntax_Bar">Code:</span><div class="my_syntax"><table><tr><td class="line_numbers"><pre>1
2
</pre></td><td class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">def</span> faketoday<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
    <span style="color: #ff7700;font-weight:bold;">return</span> date<span style="color: black;">&#40;</span><span style="color: #ff4500;">2011</span><span style="color: #66cc66;">,</span> <span style="color: #ff4500;">7</span><span style="color: #66cc66;">,</span> <span style="color: #ff4500;">4</span><span style="color: black;">&#41;</span></pre></td></tr></table></div></div>


<p>There&#8217;s a problem, though, when I try to patch (or mock out) the method:</p>


<div class="my_syntax_box"><span class="my_syntax_selecall"><a href="javascript:;" onclick="selectCode(this); return false;">Selec All</a> </span><span class="my_syntax_Bar">Code:</span><div class="my_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
</pre></td><td class="code"><pre class="python" style="font-family:monospace;"><span style="color: #66cc66;">&gt;&gt;&gt;</span> <span style="color: #ff7700;font-weight:bold;">import</span> mock
<span style="color: #66cc66;">&gt;&gt;&gt;</span> <span style="color: #ff7700;font-weight:bold;">def</span> faketoday<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:
...     <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #dc143c;">datetime</span>.<span style="color: black;">date</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">2011</span><span style="color: #66cc66;">,</span> <span style="color: #ff4500;">7</span><span style="color: #66cc66;">,</span> <span style="color: #ff4500;">4</span><span style="color: black;">&#41;</span>
... 
<span style="color: #66cc66;">&gt;&gt;&gt;</span> <span style="color: #66cc66;">@</span>mock.<span style="color: black;">patch</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;datetime.date.today&quot;</span><span style="color: #66cc66;">,</span> faketoday<span style="color: black;">&#41;</span>
... <span style="color: #ff7700;font-weight:bold;">def</span> testfunc<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:
...     <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #dc143c;">datetime</span>.<span style="color: black;">date</span>.<span style="color: black;">today</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
... 
<span style="color: #66cc66;">&gt;&gt;&gt;</span> testfunc<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
Traceback <span style="color: black;">&#40;</span>most recent call last<span style="color: black;">&#41;</span>:
  File <span style="color: #483d8b;">&quot;&lt;console&gt;&quot;</span><span style="color: #66cc66;">,</span> line <span style="color: #ff4500;">1</span><span style="color: #66cc66;">,</span> <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: #66cc66;">&lt;</span>module<span style="color: #66cc66;">&gt;</span>
  File <span style="color: #483d8b;">&quot;/Users/wbert/.virtualenvs/readsr_env/lib/python2.6/site-packages/mock.py&quot;</span><span style="color: #66cc66;">,</span> line <span style="color: #ff4500;">561</span><span style="color: #66cc66;">,</span> <span style="color: #ff7700;font-weight:bold;">in</span> patched
    arg <span style="color: #66cc66;">=</span> patching.__enter__<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
  File <span style="color: #483d8b;">&quot;/Users/wbert/.virtualenvs/readsr_env/lib/python2.6/site-packages/mock.py&quot;</span><span style="color: #66cc66;">,</span> line <span style="color: #ff4500;">623</span><span style="color: #66cc66;">,</span> <span style="color: #ff7700;font-weight:bold;">in</span> __enter__
    <span style="color: #008000;">setattr</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>.<span style="color: black;">target</span><span style="color: #66cc66;">,</span> <span style="color: #008000;">self</span>.<span style="color: black;">attribute</span><span style="color: #66cc66;">,</span> new_attr<span style="color: black;">&#41;</span>
<span style="color: #008000;">TypeError</span>: can<span style="color: #483d8b;">'t set attributes of built-in/extension type '</span><span style="color: #dc143c;">datetime</span>.<span style="color: black;">date</span><span style="color: #483d8b;">'
&gt;&gt;&gt;</span></pre></td></tr></table></div></div>


<p>datetime.date is considered a Python built-in and can&#8217;t be modified.</p>

<h3>Modifying a Class That Can&#8217;t Be Modified</h3>

<p>The trick is to write a child class that can be modified, and thus faked:</p>


<div class="my_syntax_box"><span class="my_syntax_selecall"><a href="javascript:;" onclick="selectCode(this); return false;">Selec All</a> </span><span class="my_syntax_Bar">Code:</span><div class="my_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
</pre></td><td class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">class</span> FakeDate<span style="color: black;">&#40;</span>date<span style="color: black;">&#41;</span>:
	<span style="color: #483d8b;">&quot;A fake replacement for date that can be mocked for testing.&quot;</span>
	<span style="color: #ff7700;font-weight:bold;">def</span> <span style="color: #0000cd;">__new__</span><span style="color: black;">&#40;</span>cls<span style="color: #66cc66;">,</span> *args<span style="color: #66cc66;">,</span> **kwargs<span style="color: black;">&#41;</span>:
		<span style="color: #ff7700;font-weight:bold;">return</span> date.<span style="color: #0000cd;">__new__</span><span style="color: black;">&#40;</span>date<span style="color: #66cc66;">,</span> *args<span style="color: #66cc66;">,</span> **kwargs<span style="color: black;">&#41;</span></pre></td></tr></table></div></div>


<p>All this does is create a class whose constructor returns an instance of its parent&#8217;s class, date. Usually, this would be pointless, but it&#8217;s useful here because the new class isn&#8217;t a built-in and thus can be mocked.</p>

<p>To use it, we simply decorate any test method that calls datetime.date.today() with a patch to replace datetime.date with FakeDate, and we also provide FakeDate a fake today() method that returns only and always the particular we are going to use for testing:</p>


<div class="my_syntax_box"><span class="my_syntax_selecall"><a href="javascript:;" onclick="selectCode(this); return false;">Selec All</a> </span><span class="my_syntax_Bar">Code:</span><div class="my_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
</pre></td><td class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">class</span> TestDayOfWeek<span style="color: black;">&#40;</span>TestCase<span style="color: black;">&#41;</span>:
	<span style="color: #483d8b;">&quot;&quot;&quot;Test the day of the week functions.&quot;&quot;&quot;</span>
&nbsp;
	<span style="color: #66cc66;">@</span>mock.<span style="color: black;">patch</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'series.models.date'</span><span style="color: #66cc66;">,</span> FakeDate<span style="color: black;">&#41;</span>
	<span style="color: #ff7700;font-weight:bold;">def</span> test_valid_my_next_day_of_week_sameday<span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
		<span style="color: #ff7700;font-weight:bold;">from</span> <span style="color: #dc143c;">datetime</span> <span style="color: #ff7700;font-weight:bold;">import</span> date
		FakeDate.<span style="color: black;">today</span> <span style="color: #66cc66;">=</span> <span style="color: #008000;">classmethod</span><span style="color: black;">&#40;</span><span style="color: #ff7700;font-weight:bold;">lambda</span> cls: date<span style="color: black;">&#40;</span><span style="color: #ff4500;">2011</span><span style="color: #66cc66;">,</span> <span style="color: #ff4500;">7</span><span style="color: #66cc66;">,</span> <span style="color: #ff4500;">3</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span> <span style="color: #808080; font-style: italic;"># July 3, 2011 is a Sunday</span>
		new_day_of_week <span style="color: #66cc66;">=</span> DayOfWeek.<span style="color: black;">objects</span>.<span style="color: black;">create</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
		new_day_of_week.<span style="color: black;">day</span> <span style="color: #66cc66;">=</span> <span style="color: #483d8b;">&quot;SU&quot;</span>
		<span style="color: #008000;">self</span>.<span style="color: black;">assertEquals</span><span style="color: black;">&#40;</span>new_day_of_week.<span style="color: black;">my_next_day_of_week</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: #66cc66;">,</span> date<span style="color: black;">&#40;</span><span style="color: #ff4500;">2011</span><span style="color: #66cc66;">,</span> <span style="color: #ff4500;">7</span><span style="color: #66cc66;">,</span> <span style="color: #ff4500;">3</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span></pre></td></tr></table></div></div>


<p>A couple things to note: the patch only applies within this particular method, so each method to use a patch must be decorated. Also, the real datetime.date is imported in the method so we can use it inside the fake today() method. We could put this fake today() method inside FakeClass, but making it a lambda (anonymous) method assigned inside the test case gives us the flexibility to set a particular date for each test case.</p>

<h3>Namespacing</h3>

<p>You may be wondering why the patch decorator takes &#8220;series.models.date&#8221; as the method to replace instead of &#8220;datetime.date&#8221;. That was how I tried it at first, and I was confused when it didn&#8217;t work. It seemed as if the patch hadn&#8217;t taken effect. </p>

<p>Well, it hadn&#8217;t. That&#8217;s because within the module being tested (models.py in the series app, or series.models in Python dotted notation), date has been imported like so:</p>


<div class="my_syntax_box"><span class="my_syntax_selecall"><a href="javascript:;" onclick="selectCode(this); return false;">Selec All</a> </span><span class="my_syntax_Bar">Code:</span><div class="my_syntax"><table><tr><td class="line_numbers"><pre>1
</pre></td><td class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">from</span> <span style="color: #dc143c;">datetime</span> <span style="color: #ff7700;font-weight:bold;">import</span> date</pre></td></tr></table></div></div>


<p>This means that within series.models, date is now available as series.models.date, so that&#8217;s the name that needs to be mocked out. For more on namespacing when mocking, checked out <a href="http://www.voidspace.org.uk/python/mock/patch.html#id2">Mock&#8217;s Where to patch documentation</a>.</p>

<p>Now we can supply out unit tests with any date we want, ensuring that we know what the results should be and can test against them. </p>

<h3>References</h3>

<p>Learning how to do this stuff, I posted <a href="http://stackoverflow.com/questions/6575687/how-do-i-use-mocking-to-test-a-next-day-of-week-function">my first question at Stackoverflow</a> (I ended up answering it myself). I also learned about using a fake class <a href="http://stackoverflow.com/questions/4481954/python-trying-to-mock-datetime-date-today-but-not-working">from this question</a>. Finally, the <a href="http://www.voidspace.org.uk/python/mock/index.html">Mock documentation</a> was very helpful as well.</p>
]]></content:encoded>
			<wfw:commentRss>http://williamjohnbert.com/2011/07/how-to-unit-testing-in-django-with-mocking-and-patching/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

