Sophie: xml-commons-apis-manual-1.4.01-18.mga5 noarch

xml-commons-apis-manual-1.4.01-18.mga5.noarch.rpm

<?xml version="1.0" encoding='UTF-8'?>
<!-- $Id: i18nfunctions.xml 225913 2001-06-01 11:15:37Z dims $ -->
<!--
 *************************************************************************
 * BEGINNING OF DOM I18N                                                 * 
 *************************************************************************
-->
<div1 id="i18n"> 
  <head>Accessing code point boundaries</head> 
  <orglist> 
    <member> 
      <name>Mark Davis</name> 
      <affiliation>IBM</affiliation> 
    </member> 
    <member> 
      <name>Lauren Wood</name> 
      <affiliation>SoftQuad Software Inc.</affiliation> 
    </member> 
  </orglist><?GENERATE-MINI-TOC?>
  <div2 id="i18n-introduction"> 
    <head>Introduction</head>
    <p>
      This appendix is an informative, not a normative, part of the Level 2 DOM
      specification.
    </p>

    <p>
      Characters are represented in Unicode by numbers called <i>code
      points</i> (also called <i>scalar values</i>). These numbers can range
      from 0 up to 1,114,111 = 10FFFF<sub>16</sub> (although some of these values are
      illegal). Each code point can be directly encoded with a 32-bit code unit. 
      This encoding is termed UCS-4 (or UTF-32). 
      The DOM specification, however, uses UTF-16, in which the most frequent 
      characters (which have values less than FFFF<sub>16</sub>) are represented 
      by a single 16-bit code unit, while characters above FFFF<sub>16</sub>
      use a special pair of code units called a <i>surrogate pair</i>. For more information, 
      see <bibref ref="Unicode"/> or the Unicode Web site.
    </p>

    <p>
      While indexing by code points as opposed to code units is not
      common in programs, some specifications such as XPath (and therefore XSLT
      and XPointer) use code point
      indices.  For interfacing with such formats it is recommended
      that the programming language provide string processing methods for
      converting code point indices to code unit indices and back. Some
      languages do not provide these functions natively; for these it is
      recommended that the native <code>String</code> type that is bound to
      <code>DOMString</code> be extended to enable this conversion. An example
      of how such an API might look is supplied below.
    </p>
    <note>
      <p>
	Since these methods are supplied as an illustrative example of the type
	of functionality that is required, the names of the methods,
	exceptions, and interface may differ from those given here.
      </p>
    </note>

  </div2> 
  <div2 id="i18n-methods"> 
    <head>Methods</head> 
    <definitions> 
      <interface id="i18n-methods-StringExtend" name="StringExtend">
	<descr>
	  <p>Extensions to a language's native String class or interface</p>
	</descr>
	<method id="i18n-methods-StringExtend-findOffset16" name="findOffset16">
	  <descr>
	    <p>Returns the UTF-16 offset that corresponds to a UTF-32 offset.
	      Used for random access.</p>
		<note>
		  <p>
		    You can always round-trip from a UTF-32 offset to a UTF-16
		    offset and back. You can round-trip from a UTF-16 offset to
		    a UTF-32 offset and back if and only if the offset16 is not
		    in the middle of a surrogate pair. Unmatched surrogates
		    count as a single UTF-16 value.
		  </p>
		</note>
	  </descr>
	  <parameters>
	    <param name="offset32" type="int" attr="in">
	      <descr> 
		<p>
		  UTF-32 offset. 
		</p>
	      </descr>
	    </param>
	  </parameters>
	  <returns type="int">
	    <descr>
	      <p>UTF-16 offset</p>
	    </descr>
	  </returns>
	  <raises>
	    <exception name="StringIndexOutOfBoundsException">
	      <descr>
		<p>
		  if <code>offset32</code> is out of bounds.
		</p>
	      </descr>
	    </exception>
	  </raises>
	</method>
	<method id="i18n-methods-StringExtend-findOffset32" name="findOffset32">
	  <descr>
	    <p>
	      Returns the UTF-32 offset corresponding to a UTF-16 offset. Used
	      for random access. To find the UTF-32 length of a string, use:
	      <eg>len32 = findOffset32(source, source.length());</eg>
	    </p>
	    <note>
	      <p>
		If the UTF-16 offset is into the middle of a surrogate pair,
		then the UTF-32 offset of the <emph>end</emph> of the pair is
		returned; that is, the index of the char after the end of the
		pair. You can always round-trip from a UTF-32 offset to a UTF-16
		offset and back. You can round-trip from a UTF-16 offset to a
		UTF-32 offset and back if and only if the offset16 is not in
		the middle of a surrogate pair. Unmatched surrogates count as a
		single UTF-16 value.
	      </p>
	    </note>
	  </descr>
	  <parameters>
	    <param attr="in" type="int" name="offset16">
	      <descr>
		<p>UTF-16 offset</p>
	      </descr>
	    </param>
	  </parameters>
	  <returns type="int">
	    <descr>
	      <p>UTF-32 offset</p>
	    </descr>
	  </returns>
	  <raises>
	    <exception name="StringIndexOutOfBoundsException">
	      <descr>
		<p>if offset16 is out of bounds.</p>
	      </descr>
	    </exception>
	  </raises>
	</method>
      </interface>
    </definitions> 
  </div2>
</div1>