Sophie

Sophie

distrib > Mageia > 5 > x86_64 > media > nonfree-release > by-pkgid > 9e74824375d82b5e6c4b187b256e94f6 > files > 140

python-pycuda-2014.1-6.mga5.nonfree.x86_64.rpm

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">


<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    
    <title>Metaprogramming &mdash; PyCUDA 2014.1 documentation</title>
    
    <link rel="stylesheet" href="_static/default.css" type="text/css" />
    <link rel="stylesheet" href="_static/pygments.css" type="text/css" />
    
    <script type="text/javascript">
      var DOCUMENTATION_OPTIONS = {
        URL_ROOT:    './',
        VERSION:     '2014.1',
        COLLAPSE_INDEX: false,
        FILE_SUFFIX: '.html',
        HAS_SOURCE:  true
      };
    </script>
    <script type="text/javascript" src="_static/jquery.js"></script>
    <script type="text/javascript" src="_static/underscore.js"></script>
    <script type="text/javascript" src="_static/doctools.js"></script>
    <link rel="top" title="PyCUDA 2014.1 documentation" href="index.html" />
    <link rel="next" title="Changes" href="misc.html" />
    <link rel="prev" title="GPU Arrays" href="array.html" /> 
  </head>
  <body>
    <div class="related">
      <h3>Navigation</h3>
      <ul>
        <li class="right" style="margin-right: 10px">
          <a href="genindex.html" title="General Index"
             accesskey="I">index</a></li>
        <li class="right" >
          <a href="py-modindex.html" title="Python Module Index"
             >modules</a> |</li>
        <li class="right" >
          <a href="misc.html" title="Changes"
             accesskey="N">next</a> |</li>
        <li class="right" >
          <a href="array.html" title="GPU Arrays"
             accesskey="P">previous</a> |</li>
        <li><a href="index.html">PyCUDA 2014.1 documentation</a> &raquo;</li> 
      </ul>
    </div>  

    <div class="document">
      <div class="documentwrapper">
        <div class="bodywrapper">
          <div class="body">
            
  <div class="section" id="metaprogramming">
<span id="metaprog"></span><h1>Metaprogramming<a class="headerlink" href="#metaprogramming" title="Permalink to this headline">¶</a></h1>
<p>In &#8216;conventional&#8217; programming, one writes a program that accomplishes a
task. In <em>metaprogramming</em>, one writes a program <em>that writes a program</em>
that accomplishes a task.</p>
<p>That sounds pretty complicated&#8211;so first of all, we&#8217;ll look at why it may
be a good idea nonetheless.</p>
<div class="section" id="why-metaprogramming">
<h2>Why Metaprogramming?<a class="headerlink" href="#why-metaprogramming" title="Permalink to this headline">¶</a></h2>
<div class="section" id="automated-tuning">
<h3>Automated Tuning<a class="headerlink" href="#automated-tuning" title="Permalink to this headline">¶</a></h3>
<p>A sizable part of a CUDA programmer&#8217;s time is typically spent tuning code.
This tuning answers questions like:</p>
<blockquote>
<div><ul class="simple">
<li>What&#8217;s the optimal number of threads per block?</li>
<li>How much data should I work on at once?</li>
<li>What data should be loaded into shared memory, and how big should the
corresponding blocks be?</li>
</ul>
</div></blockquote>
<p>If you are lucky, you&#8217;ll be able to find a pattern in the execution
time of your code and come up with a heuristic that will allow you to
reliably pick the fastest version. Unfortunately, this heuristic may
become unreliable or even fail entirely with new hardware generations.
The solution to this problem that PyCUDA tries to promote is:</p>
<blockquote>
<div>Forget heuristics. Benchmark at run time and use whatever works fastest.</div></blockquote>
<p>This is an important advantage of PyCUDA over the CUDA runtime API: It lets
you make these decisions <em>while your code is running</em>. A number of prominent
computing packages make use of a similar technique, among them ATLAS and
FFTW. And while those require rather complicated optimization driver
routines, you can drive PyCUDA from the comfort of Python.</p>
</div>
<div class="section" id="data-types">
<h3>Data Types<a class="headerlink" href="#data-types" title="Permalink to this headline">¶</a></h3>
<p>Your code may have to deal with different data types at run time. It may,
for example, have to work on both single and double precision floating
point numbers. You could just precompile versions for both, but why?
Just generate whatever code is needed right <em>when</em> it is needed.</p>
</div>
<div class="section" id="specialize-code-for-the-given-problem">
<h3>Specialize Code for the Given Problem<a class="headerlink" href="#specialize-code-for-the-given-problem" title="Permalink to this headline">¶</a></h3>
<p>If you are writing a library, then your users will ask your library
to perform a number of tasks. Imagine how liberating it would be if you
could generate code purposely for the problem you&#8217;re being asked to
solve, instead of having to keep code unnecessarily generic and thereby
slow. PyCUDA makes this a reality.</p>
</div>
<div class="section" id="constants-are-faster-than-variables">
<h3>Constants are Faster than Variables<a class="headerlink" href="#constants-are-faster-than-variables" title="Permalink to this headline">¶</a></h3>
<p>If your problem sizes vary from run to run, but you perform a larger
number of kernel invocations on data of identical size, you may want
to consider compiling data size into your code as a constant. This can
have significant performance benefits, resulting mainly from decreased
fetch times and less register pressure. In particular, multiplications
by constants are much more efficiently carried out than general
variable-variable multiplications.</p>
</div>
<div class="section" id="loop-unrolling">
<h3>Loop Unrolling<a class="headerlink" href="#loop-unrolling" title="Permalink to this headline">¶</a></h3>
<p>The CUDA programming guide says great things about <strong class="command">nvcc</strong> and how
it will unroll loops for you. As of Version 2.1, that&#8217;s simply not true, and
<tt class="docutils literal"><span class="pre">#pragma</span> <span class="pre">unroll</span></tt> is simply a no-op, at least according to my experience.
With metaprogramming, you can dynamically unroll your loops to the needed
size in Python.</p>
</div>
</div>
<div class="section" id="metaprogramming-using-a-templating-engine">
<h2>Metaprogramming using a Templating Engine<a class="headerlink" href="#metaprogramming-using-a-templating-engine" title="Permalink to this headline">¶</a></h2>
<p>If your metaprogramming needs are rather simple, perhaps the easiest way
to generate code at run time is through a templating engine. Many
templating engines for Python exist, two of the most prominent ones are
<a class="reference external" href="http://jinja.pocoo.org/">Jinja 2</a> and
<a class="reference external" href="http://www.cheetahtemplate.org/">Cheetah</a>.</p>
<p>The following is a simple metaprogram that performs vector addition on
configurable block sizes. It illustrates the templating-based
metaprogramming technique:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="kn">from</span> <span class="nn">jinja2</span> <span class="kn">import</span> <span class="n">Template</span>

<span class="n">tpl</span> <span class="o">=</span> <span class="n">Template</span><span class="p">(</span><span class="s">&quot;&quot;&quot;</span>
<span class="s">    __global__ void add(</span>
<span class="s">            {{ type_name }} *tgt,</span>
<span class="s">            {{ type_name }} *op1,</span>
<span class="s">            {{ type_name }} *op2)</span>
<span class="s">    {</span>
<span class="s">      int idx = threadIdx.x +</span>
<span class="s">        {{ thread_block_size }} * {{block_size}}</span>
<span class="s">        * blockIdx.x;</span>

<span class="s">      {</span><span class="si">% f</span><span class="s">or i in range(block_size) %}</span>
<span class="s">          {</span><span class="si">% s</span><span class="s">et offset = i*thread_block_size %}</span>
<span class="s">          tgt[idx + {{ offset }}] =</span>
<span class="s">            op1[idx + {{ offset }}]</span>
<span class="s">            + op2[idx + {{ offset }}];</span>
<span class="s">      {</span><span class="si">% e</span><span class="s">ndfor %}</span>
<span class="s">    }&quot;&quot;&quot;</span><span class="p">)</span>

<span class="n">rendered_tpl</span> <span class="o">=</span> <span class="n">tpl</span><span class="o">.</span><span class="n">render</span><span class="p">(</span>
    <span class="n">type_name</span><span class="o">=</span><span class="s">&quot;float&quot;</span><span class="p">,</span> <span class="n">block_size</span><span class="o">=</span><span class="n">block_size</span><span class="p">,</span>
    <span class="n">thread_block_size</span><span class="o">=</span><span class="n">thread_block_size</span><span class="p">)</span>

<span class="n">mod</span> <span class="o">=</span> <span class="n">SourceModule</span><span class="p">(</span><span class="n">rendered_tpl</span><span class="p">)</span>
</pre></div>
</div>
<p>This snippet in a working context can be found in
<tt class="file docutils literal"><span class="pre">examples/demo_meta_template.py</span></tt>.</p>
<p>You can also find an example of matrix multiplication optimization
using template metaprogramming with Cheetah in
<tt class="file docutils literal"><span class="pre">demo_meta_matrixmul_cheetah.py</span></tt> and
<tt class="file docutils literal"><span class="pre">demo_meta_matrixmul_cheetah.template.cu</span></tt>.</p>
</div>
<div class="section" id="metaprogramming-using-codepy">
<h2>Metaprogramming using <a class="reference external" href="http://documen.tician.de/codepy/index.html#module-codepy" title="(in CodePy v2013.1.2)"><tt class="xref py py-mod docutils literal"><span class="pre">codepy</span></tt></a><a class="headerlink" href="#metaprogramming-using-codepy" title="Permalink to this headline">¶</a></h2>
<p>For more complicated metaprograms, it may be desirable to have more
programmatic control over the assembly of the source code than a
templating engine can provide. The <a class="reference external" href="http://documen.tician.de/codepy/index.html#module-codepy" title="(in CodePy v2013.1.2)"><tt class="xref py py-mod docutils literal"><span class="pre">codepy</span></tt></a> package provides a means
of generating CUDA source code from a Python data structure.</p>
<p>The following example demonstrates the use of <a class="reference external" href="http://documen.tician.de/codepy/index.html#module-codepy" title="(in CodePy v2013.1.2)"><tt class="xref py py-mod docutils literal"><span class="pre">codepy</span></tt></a> for
metaprogramming. It accomplishes exactly the same as the above program:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="kn">from</span> <span class="nn">codepy.cgen</span> <span class="kn">import</span> <span class="n">FunctionBody</span><span class="p">,</span> \
        <span class="n">FunctionDeclaration</span><span class="p">,</span> <span class="n">Typedef</span><span class="p">,</span> <span class="n">POD</span><span class="p">,</span> <span class="n">Value</span><span class="p">,</span> \
        <span class="n">Pointer</span><span class="p">,</span> <span class="n">Module</span><span class="p">,</span> <span class="n">Block</span><span class="p">,</span> <span class="n">Initializer</span><span class="p">,</span> <span class="n">Assign</span>
<span class="kn">from</span> <span class="nn">codepy.cgen.cuda</span> <span class="kn">import</span> <span class="n">CudaGlobal</span>

<span class="n">mod</span> <span class="o">=</span> <span class="n">Module</span><span class="p">([</span>
    <span class="n">FunctionBody</span><span class="p">(</span>
        <span class="n">CudaGlobal</span><span class="p">(</span><span class="n">FunctionDeclaration</span><span class="p">(</span>
            <span class="n">Value</span><span class="p">(</span><span class="s">&quot;void&quot;</span><span class="p">,</span> <span class="s">&quot;add&quot;</span><span class="p">),</span>
            <span class="n">arg_decls</span><span class="o">=</span><span class="p">[</span><span class="n">Pointer</span><span class="p">(</span><span class="n">POD</span><span class="p">(</span><span class="n">dtype</span><span class="p">,</span> <span class="n">name</span><span class="p">))</span>
                <span class="k">for</span> <span class="n">name</span> <span class="ow">in</span> <span class="p">[</span><span class="s">&quot;tgt&quot;</span><span class="p">,</span> <span class="s">&quot;op1&quot;</span><span class="p">,</span> <span class="s">&quot;op2&quot;</span><span class="p">]])),</span>
        <span class="n">Block</span><span class="p">([</span>
            <span class="n">Initializer</span><span class="p">(</span>
                <span class="n">POD</span><span class="p">(</span><span class="n">numpy</span><span class="o">.</span><span class="n">int32</span><span class="p">,</span> <span class="s">&quot;idx&quot;</span><span class="p">),</span>
                <span class="s">&quot;threadIdx.x + </span><span class="si">%d</span><span class="s">*blockIdx.x&quot;</span>
                <span class="o">%</span> <span class="p">(</span><span class="n">thread_block_size</span><span class="o">*</span><span class="n">block_size</span><span class="p">)),</span>
            <span class="p">]</span><span class="o">+</span><span class="p">[</span>
            <span class="n">Assign</span><span class="p">(</span>
                <span class="s">&quot;tgt[idx+</span><span class="si">%d</span><span class="s">]&quot;</span> <span class="o">%</span> <span class="p">(</span><span class="n">o</span><span class="o">*</span><span class="n">thread_block_size</span><span class="p">),</span>
                <span class="s">&quot;op1[idx+</span><span class="si">%d</span><span class="s">] + op2[idx+</span><span class="si">%d</span><span class="s">]&quot;</span> <span class="o">%</span> <span class="p">(</span>
                    <span class="n">o</span><span class="o">*</span><span class="n">thread_block_size</span><span class="p">,</span>
                    <span class="n">o</span><span class="o">*</span><span class="n">thread_block_size</span><span class="p">))</span>
            <span class="k">for</span> <span class="n">o</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">block_size</span><span class="p">)]))])</span>

<span class="n">mod</span> <span class="o">=</span> <span class="n">SourceModule</span><span class="p">(</span><span class="n">mod</span><span class="p">)</span>
</pre></div>
</div>
<p>This snippet in a working context can be found in
<tt class="file docutils literal"><span class="pre">examples/demo_meta_codepy.py</span></tt>.</p>
</div>
</div>


          </div>
        </div>
      </div>
      <div class="sphinxsidebar">
        <div class="sphinxsidebarwrapper">
  <h3><a href="index.html">Table Of Contents</a></h3>
  <ul>
<li><a class="reference internal" href="#">Metaprogramming</a><ul>
<li><a class="reference internal" href="#why-metaprogramming">Why Metaprogramming?</a><ul>
<li><a class="reference internal" href="#automated-tuning">Automated Tuning</a></li>
<li><a class="reference internal" href="#data-types">Data Types</a></li>
<li><a class="reference internal" href="#specialize-code-for-the-given-problem">Specialize Code for the Given Problem</a></li>
<li><a class="reference internal" href="#constants-are-faster-than-variables">Constants are Faster than Variables</a></li>
<li><a class="reference internal" href="#loop-unrolling">Loop Unrolling</a></li>
</ul>
</li>
<li><a class="reference internal" href="#metaprogramming-using-a-templating-engine">Metaprogramming using a Templating Engine</a></li>
<li><a class="reference internal" href="#metaprogramming-using-codepy">Metaprogramming using <tt class="docutils literal"><span class="pre">codepy</span></tt></a></li>
</ul>
</li>
</ul>

  <h4>Previous topic</h4>
  <p class="topless"><a href="array.html"
                        title="previous chapter">GPU Arrays</a></p>
  <h4>Next topic</h4>
  <p class="topless"><a href="misc.html"
                        title="next chapter">Changes</a></p>
  <h3>This Page</h3>
  <ul class="this-page-menu">
    <li><a href="_sources/metaprog.txt"
           rel="nofollow">Show Source</a></li>
  </ul>
<div id="searchbox" style="display: none">
  <h3>Quick search</h3>
    <form class="search" action="search.html" method="get">
      <input type="text" name="q" />
      <input type="submit" value="Go" />
      <input type="hidden" name="check_keywords" value="yes" />
      <input type="hidden" name="area" value="default" />
    </form>
    <p class="searchtip" style="font-size: 90%">
    Enter search terms or a module, class or function name.
    </p>
</div>
<script type="text/javascript">$('#searchbox').show(0);</script>
        </div>
      </div>
      <div class="clearer"></div>
    </div>
    <div class="related">
      <h3>Navigation</h3>
      <ul>
        <li class="right" style="margin-right: 10px">
          <a href="genindex.html" title="General Index"
             >index</a></li>
        <li class="right" >
          <a href="py-modindex.html" title="Python Module Index"
             >modules</a> |</li>
        <li class="right" >
          <a href="misc.html" title="Changes"
             >next</a> |</li>
        <li class="right" >
          <a href="array.html" title="GPU Arrays"
             >previous</a> |</li>
        <li><a href="index.html">PyCUDA 2014.1 documentation</a> &raquo;</li> 
      </ul>
    </div>
    <div class="footer">
        &copy; Copyright 2008, Andreas Kloeckner.
      Last updated on Jan 22, 2015.
      Created using <a href="http://sphinx-doc.org/">Sphinx</a> 1.2.3.
    </div>
  </body>
</html>