org.apache.lucene.search.payloads

Index: 3rdParty_sources/lucene/org/apache/lucene/LucenePackage.java =================================================================== RCS file: /usr/local/cvsroot/3rdParty_sources/lucene/org/apache/lucene/LucenePackage.java,v diff -u -r1.1 -r1.1.2.1 --- 3rdParty_sources/lucene/org/apache/lucene/LucenePackage.java 17 Aug 2012 14:55:14 -0000 1.1 +++ 3rdParty_sources/lucene/org/apache/lucene/LucenePackage.java 16 Dec 2014 11:32:21 -0000 1.1.2.1 @@ -1,6 +1,6 @@ package org.apache.lucene; -/** +/* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. Index: 3rdParty_sources/lucene/org/apache/lucene/analysis/Analyzer.java =================================================================== RCS file: /usr/local/cvsroot/3rdParty_sources/lucene/org/apache/lucene/analysis/Analyzer.java,v diff -u -r1.1 -r1.1.2.1 --- 3rdParty_sources/lucene/org/apache/lucene/analysis/Analyzer.java 17 Aug 2012 14:55:08 -0000 1.1 +++ 3rdParty_sources/lucene/org/apache/lucene/analysis/Analyzer.java 16 Dec 2014 11:31:58 -0000 1.1.2.1 @@ -1,6 +1,6 @@ package org.apache.lucene.analysis; -/** +/* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. @@ -17,65 +17,458 @@ * limitations under the License. */ -import java.io.Reader; +import org.apache.lucene.store.AlreadyClosedException; +import org.apache.lucene.util.CloseableThreadLocal; +import org.apache.lucene.util.Version; + +import java.io.Closeable; import java.io.IOException; +import java.io.Reader; +import java.util.HashMap; +import java.util.Map; -/** An Analyzer builds TokenStreams, which analyze text. It thus represents a - * policy for extracting index terms from text. - *

- * Typical implementations first build a Tokenizer, which breaks the stream of - * characters from the Reader into raw Tokens. One or more TokenFilters may - * then be applied to the output of the Tokenizer. +/** + * An Analyzer builds TokenStreams, which analyze text. It thus represents a + * policy for extracting index terms from text. + *

+ * In order to define what analysis is done, subclasses must define their + * {@link TokenStreamComponents TokenStreamComponents} in {@link #createComponents(String, Reader)}. + * The components are then reused in each call to {@link #tokenStream(String, Reader)}. + *

+ * Simple example: + *

+ * Analyzer analyzer = new Analyzer() {
+ *  {@literal @Override}
+ *   protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
+ *     Tokenizer source = new FooTokenizer(reader);
+ *     TokenStream filter = new FooFilter(source);
+ *     filter = new BarFilter(filter);
+ *     return new TokenStreamComponents(source, filter);
+ *   }
+ * };
+ *

+ * For more examples, see the {@link org.apache.lucene.analysis Analysis package documentation}. + *

+ * For some concrete implementations bundled with Lucene, look in the analysis modules: + *

Common: + * Analyzers for indexing content in different languages and domains. + *
ICU: + * Exposes functionality from ICU to Apache Lucene. + *
Kuromoji: + * Morphological analyzer for Japanese text. + *
Morfologik: + * Dictionary-driven lemmatization for the Polish language. + *
Phonetic: + * Analysis for indexing phonetic signatures (for sounds-alike search). + *
Smart Chinese: + * Analyzer for Simplified Chinese, which indexes words. + *
Stempel: + * Algorithmic Stemmer for the Polish Language. + *
UIMA: + * Analysis integration with Apache UIMA. + *

*/ -public abstract class Analyzer { - /** Creates a TokenStream which tokenizes all the text in the provided - * Reader. Must be able to handle null field name for backward compatibility. +public abstract class Analyzer implements Closeable { + + private final ReuseStrategy reuseStrategy; + private Version version = Version.LUCENE_CURRENT; + + // non final as it gets nulled if closed; pkg private for access by ReuseStrategy's final helper methods: + CloseableThreadLocal

Term	red	magenta
Position increment	1	0

Term	IBM	International	Business	Machines
Position increment	1	0	1	1

Term	IBM	International	Business	Machines
Position increment	1	0	1	1
Position length	3	1	1	1

{@link org.apache.lucene.analysis.tokenattributes.CharTermAttribute}	+ The term text of a token. Implements {@link java.lang.CharSequence} + (providing methods length() and charAt(), and allowing e.g. for direct + use with regular expression {@link java.util.regex.Matcher}s) and + {@link java.lang.Appendable} (allowing the term text to be appended to.) +
{@link org.apache.lucene.analysis.tokenattributes.OffsetAttribute}	The start and end offset of a token in characters.
{@link org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute}	See above for detailed information about position increment.
{@link org.apache.lucene.analysis.tokenattributes.PositionLengthAttribute}	The number of positions occupied by a token.
{@link org.apache.lucene.analysis.tokenattributes.PayloadAttribute}	The payload that a Token can optionally have.
{@link org.apache.lucene.analysis.tokenattributes.TypeAttribute}	The type of the token. Default is 'word'.
{@link org.apache.lucene.analysis.tokenattributes.FlagsAttribute}	Optional flags a token can have.
{@link org.apache.lucene.analysis.tokenattributes.KeywordAttribute}	+ Keyword-aware TokenStreams/-Filters skip modification of tokens that + return true from this attribute's isKeyword() method. +

Parsing? Tokenization? Analysis!

Parsing

Tokenization

Core Analysis

Hints, Tips and Traps

Invoking the Analyzer

Indexing Analysis vs. Search Analysis

Implementing your own Analyzer

Field Section Boundaries

Field Section Boundaries

Token Position Increments

Token Position Increments

Token Position Length

How to not write corrupt token streams

TokenStream API

Attribute and AttributeSource

More Requirements for Analysis Component Classes

Token Stream Lifetime

Tokenizer

Token Filter

Creating delegates

Testing Your Analysis Component

Using the TokenStream API

Example

Whitespace tokenization

Adding a LengthFilter

Adding a custom Attribute

Adding a CharFilter chain

Document and Fieldable

Document and IndexableField

Working with Documents

Table Of Contents

Postings APIs

+ Fields +

+ Terms +

+ Documents +

+ Positions +

Index Statistics

+ Term statistics +

+ Field statistics +

+ Segment statistics +

+ Document statistics +

Search

Search Basics

Query Classes

- TermQuery + {@link org.apache.lucene.search.TermQuery TermQuery}

- BooleanQuery + {@link org.apache.lucene.search.BooleanQuery BooleanQuery}

Phrases

- RangeQuery + {@link org.apache.lucene.search.TermRangeQuery TermRangeQuery}

- PrefixQuery, - WildcardQuery + {@link org.apache.lucene.search.NumericRangeQuery NumericRangeQuery}

+ {@link org.apache.lucene.search.PrefixQuery PrefixQuery}, + {@link org.apache.lucene.search.WildcardQuery WildcardQuery}, + {@link org.apache.lucene.search.RegexpQuery RegexpQuery} +

- FuzzyQuery + {@link org.apache.lucene.search.FuzzyQuery FuzzyQuery}

Changing Similarity

Scoring — Introduction

Scoring — Basics

Fields and Documents

Score Boosting

Changing Scoring — Similarity

Changing Scoring — Expert Level

Custom Queries — Expert Level

The Query Class

The Weight Interface

The Scorer Class

The BulkScorer Class

Why would I want to add my own Query?

Examples

Appendix: Search Algorithm