Yoichi Kawasaki's Web

PatternReplaceCharFilter in Lucene API

February 11, 2014

Overview

  • Abstract: Code Sample for PatternReplaceCharFilter in Lucene core API
  • Launguage(s): Java
  • Prerequisites
    • Lucene-4 API or greater (note: Lucene-4.6 is used for the code sample)
    • Apache Lucene runs of Java 6 or greater

Sample Code

PatternReplaceCharFilter is a subclass of CharFilter that allows the use of a regex to manipulate the input string before tokenizer processes. The regular expression is defined using the pattern parameter, and the replacement string can be provided using the replacement parameter. In the following example, a PatternReplaceCharFilter instance manipulate the input string by matching email address and replacing them with empty string “”, then tokenizer and tokefilters follow.

public final class PatternReplaceCharFilterDemo {

    private static void displayTokens(TokenStream ts) throws IOException {
        CharTermAttribute termAttr = ts.addAttribute(CharTermAttribute.class);
        ts.reset();
        while (ts.incrementToken()) {
            String token = termAttr.toString();
            System.out.print("[" + token + "] ");
        }
        System.out.println();
        ts.end();
        ts.close();
    }

    public static void main(String[] args) throws Exception {
        String testinput =
        "ID1 alias@ ID2 yoichi@foo.com ID3 090-1234-5678 ID4 kawasaki@bar.com";
        Version ver=Version.LUCENE_46;

        Pattern pattern =
            Pattern.compile("[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]+",
                            Pattern.CASE_INSENSITIVE);
        String replacement ="";

        CharFilter cs =
            new PatternReplaceCharFilter(
                    pattern, replacement, new StringReader(testinput));
        WhitespaceTokenizer tokenizer =
             new WhitespaceTokenizer(ver, cs);
        TokenStream ts = new LowerCaseFilter(ver, tokenizer);
        ts = new StopFilter(ver, ts,
                    StopAnalyzer.ENGLISH_STOP_WORDS_SET);
        displayTokens(ts);
    }
}

Running the code

INPUT

ID1 alias@ ID2 yoichi@foo.com ID3 090-1234-5678 ID4 kawasaki@bar.com

OUTPUT without PatternReplaceCharFilter

id1] [alias@] [id2] [yoichi@foo.com] [id3] [090-1234-5678] [id4] [kawasaki@bar.com]

OUTPUT with PatternReplaceCharFilter

[id1] [alias@] [id2] [id3] [090-1234-5678] [id4]

Alternative Ways

PatternReplaceCharFilterFactory is a factory for PatternReplaceCharFilter. In the following sample, a PatternReplaceCharFilter instance, which is created by the factory, do caracters manipulation. The factory takes a map to which the regular expression and replacement string can be set as the values for “pattern” Key and “replacement” Key respectively.
[note] There is no param to provide case sensitivity of the regular expression process, which mean both the regular expression and the replacement string provided here are always case sensitive. Therefore both upper and lower case chars should be included in the regular expression if you want the both to match.

String testinput =
"ID1 alias@ ID2 yoichi@foo.com ID3 090-1234-5678 ID4 kawasaki@bar.com";
Version ver=Version.LUCENE_46;

Map<String,String> filterargs=new HashMap<String, String>();
filterargs.put("luceneMatchVersion", ver.toString());
filterargs.put("pattern", "[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]+");
filterargs.put("replacement", "");

PatternReplaceCharFilterFactory factory =
        new PatternReplaceCharFilterFactory(filterargs);
Reader cs = factory.create(new StringReader(testinput) );
WhitespaceTokenizer tokenizer =
        new WhitespaceTokenizer(ver, cs);
TokenStream ts = new LowerCaseFilter(ver, tokenizer);
ts = new StopFilter(ver, ts,
        StopAnalyzer.ENGLISH_STOP_WORDS_SET);

In Solr, the Map args, “filterargs” specified in the code above can be defined in Solr’s schema.xml like the following:

<fieldType name="text_ptnreplace" class="solr.TextField"
                                    positionIncrementGap="100">
  <analyzer>
    <charFilter class="solr.PatternReplaceCharFilterFactory"
           pattern="[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]+"
           replacement=""/>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  </analyzer>
</fieldType>

No Comments

HTMLStripCharFilter in Lucene API

Overview

  • Abstract: Code Sample for HTMLStripCharFilter in Lucene core API
  • Launguage(s): Java
  • Prerequisites
    • Lucene-4 API or greater (note: Lucene-4.6 is used for the code sample)
    • Apache Lucene runs of Java 6 or greater

Sample Code

HTMLStripCharFilter is a subclass of CharFilter that strips out HTML elements from the input text string before tokenizer processes. By defult, the HTMLStripCharFilter strips out predefined HTML elements (See here for all HTML stripping features for this filter) but you can provide tag sets that will not be stripped out. In the following example, a HTMLStripCharFilter instance strips HTML elements excepts title and h1 tags set in escapedTags from the input text string, then tokenizer and tokefilter processes follow.

HTMLStripCharFilterDemo.java

package samples.lucene.analysis;

import java.util.Set;
import java.util.HashSet;
import java.util.Map;
import java.util.HashMap;
import java.io.Reader;
import java.io.StringReader;
import java.io.IOException;

import org.apache.lucene.analysis.core.StopAnalyzer;
import org.apache.lucene.analysis.core.StopFilter;
import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.core.WhitespaceTokenizer;
import org.apache.lucene.analysis.charfilter.HTMLStripCharFilter;
import org.apache.lucene.analysis.CharFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.util.Version;

public final class HTMLStripCharFilterDemo {

    private static void displayTokens(TokenStream ts) throws IOException {
        CharTermAttribute termAttr = ts.addAttribute(CharTermAttribute.class);
        ts.reset();
        while (ts.incrementToken()) {
            String token = termAttr.toString();
            System.out.print("[" + token + "] ");
        }
        System.out.println();
        ts.end();
        ts.close();
    }

    public static void main(String[] args) throws Exception {
        String testinput =
         "<h1>HTMLStripCharFilter</h1> "
         + "<p><em><strong>strips html tags</strong></em></p>";
        Version ver=Version.LUCENE_46;

        Set<String> escapedTags=new HashSet<String>();
        escapedTags.add("title");
        escapedTags.add("h1");

        Reader reader =
            new HTMLStripCharFilter(new StringReader(testinput), escapedTags);
        WhitespaceTokenizer tokenizer =
            new WhitespaceTokenizer(ver, reader);
        TokenStream ts = new LowerCaseFilter(ver, tokenizer);
        ts = new StopFilter(ver, ts,
                    StopAnalyzer.ENGLISH_STOP_WORDS_SET);
        displayTokens(ts);
    }
}

Running the code

INPUT

<h1>HTMLStripCharFilter</h1> <p><em><strong>strips html tags</strong></em></p>

OUTPUT without HTMLStripCharFilter

[<h1>htmlstripcharfilter</h1>] [<p><em><strong>strips] [html] [tags</strong></em></p>]

OUTPUT with HTMLStripCharFilter

[<h1>htmlstripcharfilter</h1>] [strips] [html] [tags]

Alternative Ways

HTMLStripCharFilterFactory is a factory for HTMLStripCharFilter. In the following sample, a HTMLStripCharFilter instance, which is created by the factory, do caracters manipulation. The factory takes a map to which the escapedTag sets can be set as the values for escapedTags Key.

String testinput =
   "<h1>HTMLStripCharFilter</h1> "
   + "<p><em><strong>strips html tags</strong></em></p>";
Version ver=Version.LUCENE_46;

Map<String,String> filterargs=new HashMap<String, String>();
filterargs.put("luceneMatchVersion", ver.toString());
filterargs.put("escapedTags","title,h1");

HTMLStripCharFilterFactory factory =
      new HTMLStripCharFilterFactory(filterargs);
Reader cs = factory.create(new StringReader(testinput));
WhitespaceTokenizer tokenizer =
      new WhitespaceTokenizer(ver, cs);
TokenStream ts = new LowerCaseFilter(ver, tokenizer);
ts = new StopFilter(ver, ts,
      StopAnalyzer.ENGLISH_STOP_WORDS_SET);

In Solr, the Map args, “filterargs” specified in the code above can be defined in Solr’s schema.xml like the following:

<fieldType name="text_html" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <charFilter class="solr.HTMLStripCharFilterFactory" escapedTags="title,h1" />
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  </analyzer>
</fieldType>

No Comments

Changing My Site Domain Name

This is an announcement – As you may know, I’ve changed my site domain name to copious.me from yk55.com. It was just impulsive and from no special reasons, I just felt like dumping the old domain all of sudden. But I’ll probably keep the old one for a while, maybe until the time when traffic migration is mostly done.

Here are Zone and redirection configuration that I’ve done for the migration:

DNS Zone Configuration for the new domain

@       NS      ns50.domaincontrol.com.
        NS      ns49.domaincontrol.com.
        A       49.212.213.110
www     CNAME   @
feed    CNAME   ytmyxs.feedproxy.ghs.google.com.

Currently both yk55.com(old domain) and copious.me(new domain) point to my sakura VPN address, 49.212.213.110

$ dig @ns49.domaincontrol.com copious.me
;; ANSWER SECTION:
copious.me.             600     IN      A       49.212.213.110
$  dig @ns1.dns.ne.jp yk55.com
;; ANSWER SECTION:
yk55.com.               3600    IN      A       49.212.213.110

301 Redirect to the new domain from the old one

<VirtualHost  *>
    ServerName yk55.com
    ...
    RewriteEngine On
    RewriteCond %{REQUEST_URI}  ^/blog
    RewriteRule ^/blog/(.*)    http://copious.me/posts/$1 [R=301,L]
    RewriteCond %{REQUEST_URI}  !^/blog
    RewriteRule ^/(.*)         http://copious.me/$1 [R=301,L]
</VirtualHost>

A point is using a 301 Redirect to permanently redirect all pages on my old domain to the new domain. According to Google Webmaster tools help:

This(301 redirect) tells search engines and users that your site has permanently moved. We recommend that you move and redirect a section or directory first, and then test to make sure that your redirects are working correctly before moving all your content.

No Comments

MappingCharFilter in Lucene API

February 7, 2014

Overview

  • Abstract: Code Sample for MappingCharFilter in Lucene API
  • Launguage(s): Java
  • Prerequisites
    • Lucene-4 API or greater (note: Lucene-4.6 is used for the code sample)
    • Apache Lucene runs of Java 6 or greater

Sample Code

The MappingCharFilter is a subclass of CharFilter that normalizes characters before tokenizer processes. The MappingCharFilter takes a NormalizeCharMap into which mapping word sets are added and applies the mappings contained in the NormalizeCharMap to the character stream. MappingCharFilterFactory is a factory to create MappingCharFilter instances, and the MappingCharFilterFactory internally loads mapping word sets into a NormalizeCharMap from a file that contains characters mapping entries. In the following example, a MappingCharFilte instance, that is created by MappingCharFilterFactory, applies the mappings that are loaded from mapping.txt to the character stream, then tokenization and token filter processes follow.

MappingCharFilterDemo.java

package samples.lucene.analysis;

import java.util.Map;
import java.util.HashMap;
import java.io.Reader;
import java.io.StringReader;
import java.io.IOException;
import org.apache.lucene.analysis.core.WhitespaceTokenizer;
import org.apache.lucene.analysis.core.StopAnalyzer;
import org.apache.lucene.analysis.core.StopFilter;
import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.charfilter.MappingCharFilterFactory;
import org.apache.lucene.analysis.CharFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.util.FilesystemResourceLoader;
import org.apache.lucene.util.Version;

public final class MappingCharFilterDemo {

    private static void displayTokens(TokenStream ts) throws IOException {
        CharTermAttribute termAttr = ts.addAttribute(CharTermAttribute.class);
        ts.reset();
        while (ts.incrementToken()) {
            String token = termAttr.toString();
            System.out.print("[" + token + "] ");
        }
        System.out.println();
        ts.end();
        ts.close();
    }

    public static void main(String[] args) throws Exception {
        String testinput = "ØØ ÅßÇ to ABC ×× C++ to cplusplus ×× 惡 to 悪 ØØ";
        Version ver=Version.LUCENE_46;

        String mapfile="/path/mapping.txt";
        Map<String,String> filterargs=new HashMap<String, String>();
        filterargs.put("luceneMatchVersion", ver.toString());
        filterargs.put("mapping", mapfile);

        MappingCharFilterFactory factory
                = new MappingCharFilterFactory(filterargs);
        factory.inform(new FilesystemResourceLoader());

        Reader cs = factory.create(new StringReader(testinput));
        WhitespaceTokenizer tokenizer =
                new WhitespaceTokenizer(ver, cs);
        TokenStream ts = new LowerCaseFilter(ver, tokenizer);
        ts = new StopFilter(ver, ts,
                StopAnalyzer.ENGLISH_STOP_WORDS_SET);
        displayTokens(ts);
    }
}

In Solr, the Map args, “filterargs” specified in the code above can be defined in Solr’s schema.xml like the following:

<fieldType name="text_map" class="solr.TextField" positionIncrementGap="100">
   <analyzer>
     <charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
   </analyzer>
</fieldType>

mapping.txt

## blank lines and lines starting with '#' are comments

## mappings of accented chars to unaccented ones
# Å => A
"\u00C5" => "A"
# ß => B
"\u00DF" => "B"
# Ç => C
"\u00C7" => "C"

## mapping from a UTF-8 string to another UTF-8 string
"c++" => "cplusplus"
"惡" => "悪"

## mapping a special mark to empty (removal)
# "×" => "" (empty)
"\u00D7" => ""
# "Ø" => "" (empty)
"\u00D8" => ""

Please see mapping-FoldToASCII.txt as a sample mapping conf in Solr. Also see UTF-8 encoding table and Unicode characters for Unicode chars reference.


Running the code

INPUT

ØØ ÅßÇ to ABC ×× C++ to cplusplus ×× 惡 to 悪 ØØ

OUTPUT without MappingCharFilter

[øø] [åßç] [abc] [××] [c++] [cplusplus] [××] [惡] [悪] [øø]

OUTPUT with MappingCharFilter

[abc] [abc] [c++] [cplusplus] [悪] [悪]

Alternative Ways

Instead of using MappingCharFilterFactory, you can directly populate NormalizeCharMap one by one like this:

NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
builder.add("\u00C5", "A");
builder.add("\u00DF", "B");
builder.add("\u00C7", "C");
builder.add("c++", "cpluspuls");
builder.add("惡", "悪");
builder.add("\u00D7", "");
builder.add("\u00D8", "");
NormalizeCharMap mapping = builder.build();
String testinput = "ØØ ÅßÇ to ABC ×× C++ to cplusplus ×× 惡 to 悪 ØØ";
Version ver=Version.LUCENE_46;
CharFilter cs = new MappingCharFilter(mapping, new StringReader(testinput));
WhitespaceTokenizer tokenizer =
        new WhitespaceTokenizer(ver, cs);
TokenStream ts = new LowerCaseFilter(ver, tokenizer);

Or you can read all mapping words and populate NormalizeCharMap by yourself like the following but of course it would be much easiler and saving your time to use MappingCharFilterFactory.

static Pattern pattern = Pattern.compile( "\"(.*)\"\\s*=>\\s*\"(.*)\"\\s*$" );
String mappingfile="/path/mapping.txt";
BufferedReader buffreader = new BufferedReader(
                    new InputStreamReader(
                        new FileInputStream(
                            new File(mappingfile)
                        )
                    )
                );

NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
String line;
while((line=buffreader.readLine())!=null){
    Matcher m = pattern.matcher( line );
    if( m.find() ){
        builder.add(  m.group( 1 ),  m.group( 2 ) );
    }
    buffreader.close();
}
NormalizeCharMap mapping = builder.build();

1 Comment

SynonymFilter in Lucene API

February 5, 2014

Overview

  • Abstract: Code Sample for SynonymFilter in Lucene API
  • Launguage(s): Java
  • Prerequisites
    • Lucene-4 API or greater (note: Lucene-4.6 is used for the code sample)
    • Apache Lucene runs of Java 6 or greater

Sample Code

The use of synonyms may improve search recall and the SynonymFilter allows to easily handle synonyms during the Lucene’s analysis process.
The SynonymFilter takes a SynonymMap, a map of synonyms, keys and values are phrases. SynonymFilterFactory is a factory class to create a SynonymFilter instance, and the factory internally loads synonym sets into the SynonymMap from a file that contains synonym mapping entries. In the following example, a SynonymFilter instance, that is created by SynonymFilterFactory, processes synonym operations to the token stream that is previously processed by other token filters.

SynonymFilterDemo.java

package samples.lucene.analysis;

import java.util.Map;
import java.util.HashMap;
import java.io.StringReader;
import java.io.IOException;
import org.apache.lucene.analysis.core.StopAnalyzer;
import org.apache.lucene.analysis.core.StopFilter;
import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.analysis.synonym.SynonymFilterFactory;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.util.FilesystemResourceLoader;
import org.apache.lucene.util.Version;
import org.apache.lucene.util.CharsRef;

public final class SynonymFilterDemo {

    private static void displayTokens(TokenStream ts) throws IOException {
        CharTermAttribute termAttr = ts.addAttribute(CharTermAttribute.class);
        ts.reset();
        while (ts.incrementToken()) {
            String token = termAttr.toString();
            System.out.print("[" + token + "] ");
        }
        System.out.println();
        ts.end();
        ts.close();
    }

    public static void main(String[] args) throws Exception {

        String testinput = "I am going to buy i-Pod, Xbox 360 and TV today";
        Version ver=Version.LUCENE_46;

        String synfile="/path/synonyms.txt";
        Map<String,String> filterargs=new HashMap<String, String>();
        filterargs.put("luceneMatchVersion", ver.toString());
        filterargs.put("synonyms", synfile);
        filterargs.put("ignoreCase", "false");
        filterargs.put("format", "solr");
        filterargs.put("expand", "true");

        SynonymFilterFactory factory  = new SynonymFilterFactory(filterargs);
        factory.inform(new FilesystemResourceLoader());

        StandardTokenizer tokenizer =
            new StandardTokenizer(ver, new StringReader(testinput));
        TokenStream ts = new LowerCaseFilter(ver,tokenizer);
        ts = new StopFilter(ver, ts,
                    StopAnalyzer.ENGLISH_STOP_WORDS_SET);
        ts = factory.create(ts);
        displayTokens(ts);
    }
}

In Solr, the Map args, “filterargs” specified in the code above can be defined in Solr’s schema.xml like the following:

 <fieldType name="text_synonym" class="solr.TextField" positionIncrementGap="100">
   <analyzer>
     <tokenizer class="solr.StandardTokenizerFactory"/>
     <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
            format="solr" ignoreCase="false" expand="true"
            tokenizerFactory="solr.StandardTokenizerFactory"
            [optional tokenizer factory parameters]/>
   </analyzer>
 </fieldType>

synonyms.txt
Please check Lucene API doc out for the Solr synonyms format

# blank lines and lines starting with '#' are comments
# explicit mappings match and replace
foo => bar
buy => purchase
i Pod,i-Pod => iPod
Xbox one,Xbox live,Xbox 360 => Xbox
# synonym groups
Television, Televisions, TV, TVs

Running the code

INPUT

I am going to buy i-Pod, Xbox 360 and TV today

OUTPUT in the case SynonymFilter NOT process

 [i] [am] [going] [buy] [i] [pod] [xbox] [360] [tv] [today]

OUTPUT after SynonymFilter processes

 [i] [am] [going] [purchase] [ipod] [xbox] [television] [televisions] [tv] [tvs] [today]

Alternative Ways

Instead of using SynonymFilterFactory, you can directly set synonym sets into SynonymMap one by one like this:

// buy => purchase
String base1 = "buy";
String synonym1 = "purchase";
// i-Pod => ipod
String base2 = "i-Pod";
String synonym2 = "iPod";
SynonymMap.Builder sb = new SynonymMap.Builder(true);
sb.add(new CharsRef(base1), new CharsRef(syn1), true);
sb.add(new CharsRef(base2), new CharsRef(syn2), true);
SynonymMap synonyms =  sb.build();
String testinput = "I am going to buy i-Pod, Xbox 360 and TV today";
StandardTokenizer tokenizer =
                new StandardTokenizer(Version.LUCENE_46,
                                new StringReader(testinput));
TokeStream ts = new SynonymFilter(tokenizer, synonyms, false);

In addition, you can alternatively use SolrSynonymParser which allows to load synonym sets from synonyms.txt into SynonymMap:

String synfile="/path/synonyms.txt";
BufferedReader buffreader = new BufferedReader(
                    new InputStreamReader(
                        new FileInputStream(
                            new File(synfile)
                        )
                    )
                );

SolrSynonymParser parser = new SolrSynonymParser(true, true,
                          new SimpleAnalyzer(Version.LUCENE_46));
parser.parse(buffreader);
SynonymMap synonyms = parser.build();
String testinput = "I am going to buy i-Pod, Xbox 360 and TV today";
StandardTokenizer tokenizer =
                new StandardTokenizer(Version.LUCENE_46,
                                new StringReader(testinput));
TokeStream ts = new SynonymFilter(tokenizer, synonyms, false);

No Comments