libstemmer

  • class SnowballStemmerException: object.Exception;

  • struct SnowballStemmer;

    Encapsulates a stemmer, providing safe interface to it.

    This struct is non-copyable. If this makes you unhappy, allocate it with new or safeRefCounted.

    • enum Encoding: string;

      • utf8
        iso8859_1
        iso8859_2
        koi8r

    • static pure nothrow @nogc @trusted immutable(string)[] algorithms();

      Get a list of supported stemming algorithms (i.e., languages).

      Only the canonical name of each algorithm is returned: "english\0" is there, but "en\0" is not. See modules.txt to get an impression of what this list may look like.

    • pure @safe this(scope const(char)[] algorithm, scope Encoding encoding = Encoding.utf8) scope;

      Construct a stemmer with the specified algorithm and input encoding.

      algorithm and encoding are case-sensitive and must be zero-terminated (e.g., you have to pass "en\0", not "en"); that is asserted.

      This constructor is unavailable in betterC mode.

      Throws

      SnowballStemmerException on an unknown algorithm or encoding or an unsupported combination of those.

    • pure nothrow @nogc @trusted _Bool reset(scope const(char)[] algorithm, scope Encoding encoding = Encoding.utf8) scope;

      Try to change the algorithm and encoding used by this stemmer.

      algorithm and encoding are case-sensitive and must be zero-terminated (e.g., you have to pass "en\0", not "en"); that is asserted.

      The return type of this method implicitly converts to bool. If algorithm or encoding are unknown or their combination is unsupported, then false is returned and no changes are made.

    • pure nothrow @nogc @safe bool isNull() const scope;

    • pure nothrow @nogc @system this(sb_stemmer* handle) scope;

      Acquire ownership over a low-level stemmer.

      It will be deleted automatically, hence @system.

    • pure nothrow @nogc @system inout(sb_stemmer)* handle() inout scope;

      Get the low-level stemmer.

      Manipulating it directly may interfere with SnowballStemmer, hence @system.

    • pure nothrow @nogc @trusted sb_stemmer* release() scope;

      Extract the low-level stemmer.

      From now on, you are responsible for deleting it.

  • @safe auto stemUtf8(alias callback)(ref scope SnowballStemmer st, scope const(char)[] word);
    @safe auto stem(alias callback)(ref scope SnowballStemmer st, scope const(ubyte)[] word);

    Determine the stem of the given word.

    The stem is passed to callback, which it must not escape. (If you compile with -dip1000, the compiler will enforce that.) Also, callback has to be @safe or @trusted. Whatever it returns will be passed back to the caller.

    During callback invocation, you cannot stem another word with the same stemmer. (Doing so will result in assertion failure.)

    stemUtf8 does not actually require the stemmer to be created with Encoding.utf8; it is merely a convenience function that inserts char[ ] <-> ubyte[ ] casts. It can be used interchangeably with stem; but there is a convention in the D community that char[ ] contains UTF-8 and ubyte[ ] holds arbitrary binary data.

    Note these are not member functions (to avoid deprecations about dual context). Thanks to UFCS, most of the time there is no difference.

    Examples

    1. auto st = SnowballStemmer("en\0");
      st.stemUtf8!((stem) {
          assert(stem == "minifi");
      })("minify");
      

  • pure nothrow @safe string stemUtf8(ref scope SnowballStemmer st, scope const(char)[] word);
    pure nothrow @safe immutable(ubyte)[] stem(ref scope SnowballStemmer st, scope const(ubyte)[] word);

    Determine the stem of the given word; allocate from the GC heap.

    Only provided for convenience. When possible, you are encouraged to use the other overload, which does not allocate; please refer to it for detailed documentation.

    Examples

    1. auto st = SnowballStemmer("en\0");
      assert(st.stemUtf8("minify") == "minifi");