Using Lucene to Search Java Source Code
Pages: 1, 2, 3, 4, 5
Lucene has four different types of fields, which can be specified for optimal index creation: Keyword, UnIndexed, UnStored, and Text.
- Keyword fields are those that are not parsed by the analyzer, but are indexed and stored in the index.
JavaSourceCodeIndexeruses this field to store import declarations. - UnIndexed fields are neither analyzed nor indexed, but their values are stored in the index, word for word. The Java file name is indexed with this field, as we would want to store the location of the file but would rarely search for keywords in the file name.
- UnStored fields are the opposite of
UnIndexedfields. Fields of this type are analyzed and indexed, but are not stored in the index. The source code of the method is indexed as anUnStoredcodefield, as storing every line of code would require a large amount of space. The source code of a method can be directly retrieved from the original Java file, resulting in an optimal index size. - Text fields are analyzed, indexed, and stored in the index. The class name is stored as a text field. The summary of the
Fields used byJavaSourceCodeIndexeris shown in the following table:
| Field | Type |
| Class Name | Text |
| Import Declarations | Keyword |
| Method Name | Text |
| Method Block (Code) | UnStored |
| File Name | UnIndexed |
| Method Parameter Type | Text |
| Return Type | Text |
| Comments | UnStored |
| Extends Class | Text |
| Implements | Text |
The indexes created by Lucene can be viewed and modified using Luke, a useful open source tool for understanding indexes. Luke's snapshot of the indexes creates by JavaSourceCodeIndexer is shown in Figure 1.

Figure 1. Snapshot of indexes in Luke
As you can see, the import declarations are stored as is, without tokenizing or analyzing. The class names and method names are converted to lower case and stored.