File input plugin

  • Plugin version: v4.1.10
  • Released on: 2019-03-12
  • Changelog

For other versions, see theVersioned plugin docs.

Getting Help

For questions about the plugin, open a topic in the Discuss forums. For bugs or feature requests, open an issue in Github.For the list of Elastic supported plugins, please consult the Elastic Support Matrix.

Description

Stream events from files, normally by tailing them in a mannersimilar to tail -0F but optionally reading them from thebeginning.

Normally, logging will add a newline to the end of each line written.By default, each event is assumed to be one lineand a line is taken to be the text before a newline character.If you would like to join multiple log lines into one event,you’ll want to use the multiline codec.The plugin loops between discovering new files and processingeach discovered file. Discovered files have a lifecycle, they start offin the "watched" or "ignored" state. Other states in the lifecycle are:"active", "closed" and "unwatched"

By default, a window of 4095 files is used to limit the number of file handles in use.The processing phase has a number of stages:

  • Checks whether "closed" or "ignored" files have changed in size since last time andif so puts them in the "watched" state.
  • Selects enough "watched" files to fill the available space in the window, these filesare made "active".
  • The active files are opened and read, each file is read from the last known positionto the end of current content (EOF) by default.

In some cases it is useful to be able to control which files are read first, sorting,and whether files are read completely or banded/striped.Complete reading is all of file A then file B then file C and so on.Banded or striped reading is some of file A then file B then file C and so on looping aroundto file A again until all files are read. Banded reading is specified by changingfile_chunk_count and perhaps file_chunk_size.Banding and sorting may be useful if you want some events from all files to appearin Kibana as early as possible.

The plugin has two modes of operation, Tail mode and Read mode.

Tail mode

In this mode the plugin aims to track changing files and emit new content as it’sappended to each file. In this mode, files are seen as a never ending stream ofcontent and EOF has no special significance. The plugin always assumes thatthere will be more content. When files are rotated, the smaller or zero size isdetected, the current position is reset to zero and streaming continues.A delimiter must be seen before the accumulated characters can be emitted as a line.

Read mode

In this mode the plugin treats each file as if it is content complete, that is,a finite stream of lines and now EOF is significant. A last delimiter is notneeded because EOF means that the accumulated characters can be emitted as a line.Further, EOF here means that the file can be closed and put in the "unwatched"state - this automatically frees up space in the active window. This mode alsomakes it possible to process compressed files as they are content complete.Read mode also allows for an action to take place after processing the file completely.

In the past attempts to simulate a Read mode while still assuming infinite streamswas not ideal and a dedicated Read mode is an improvement.

Tracking of current position in watched files

The plugin keeps track of the current position in each file byrecording it in a separate file named sincedb. This makes itpossible to stop and restart Logstash and have it pick up where itleft off without missing the lines that were added to the file whileLogstash was stopped.

By default, the sincedb file is placed in the data directory of Logstashwith a filename based on the filename patterns being watched (i.e. the path option).Thus, changing the filename patterns will result in a new sincedb file being used andany existing current position state will be lost. If you change your patternswith any frequency it might make sense to explicitly choose a sincedb pathwith the sincedb_path option.

A different sincedb_path must be used for each input. Using the samepath will cause issues. The read checkpoints for each input must bestored in a different path so the information does not override.

Files are tracked via an identifier. This identifier is made up of theinode, major device number and minor device number. In windows, a differentidentifier is taken from a kernel32 API call.

Sincedb records can now be expired meaning that read positions of older fileswill not be remembered after a certain time period. File systems may need to reuseinodes for new content. Ideally, we would not use the read position of old content,but we have no reliable way to detect that inode reuse has occurred. This is morerelevant to Read mode where a great many files are tracked in the sincedb.Bear in mind though, if a record has expired, a previously seen file will be read again.

Sincedb files are text files with four (< v5.0.0), five or six columns:

  1. The inode number (or equivalent).
  2. The major device number of the file system (or equivalent).
  3. The minor device number of the file system (or equivalent).
  4. The current byte offset within the file.
  5. The last active timestamp (a floating point number)
  6. The last known path that this record was matched to (forold sincedb records converted to the new format, this is blank.

On non-Windows systems you can obtain the inode number of a filewith e.g. ls -li.

Reading from remote network volumes

The file input is not thoroughly tested on remote filesystems such as NFS,Samba, s3fs-fuse, etc, however NFS is occasionally tested. The file size as given bythe remote FS client is used to govern how much data to read at any given time toprevent reading into allocated but yet unfilled memory.As we use the device major and minor in the identifier to track "last read"positions of files and on remount the device major and minor can change, thesincedb records may not match across remounts.Read mode might not be suitable for remote filesystems as the file size atdiscovery on the client side may not be the same as the file size on the remote sidedue to latency in the remote to client copy process.

File rotation in Tail mode

File rotation is detected and handled by this input, regardless ofwhether the file is rotated via a rename or a copy operation. Tosupport programs that write to the rotated file for some time afterthe rotation has taken place, include both the original filename andthe rotated filename (e.g. /var/log/syslog and /var/log/syslog.1) inthe filename patterns to watch (the path option).For a rename, the inode will be detected as having moved from/var/log/syslog to /var/log/syslog.1 and so the "state" is movedinternally too, the old content will not be reread but any new contenton the renamed file will be read.For copy/truncate the copied content into a new file path, if discovered, willbe treated as a new discovery and be read from the beginning. The copied filepaths should therefore not be in the filename patterns to watch (the path option).The truncation will be detected and the "last read" position updated to zero.

File Input Configuration Options

This plugin supports the following configuration options plus the Common Options described later.

Note

Duration settings can be specified in text form e.g. "250 ms", this string will be converted intodecimal seconds. There are quite a few supported natural and abbreviated durations,see string_duration for the details.

Setting Input type Required

close_older

number or string_duration

No

delimiter

string

No

discover_interval

number

No

exclude

array

No

file_chunk_count

number

No

file_chunk_size

number

No

file_completed_action

string, one of ["delete", "log", "log_and_delete"]

No

file_completed_log_path

string

No

file_sort_by

string, one of ["last_modified", "path"]

No

file_sort_direction

string, one of ["asc", "desc"]

No

ignore_older

number or string_duration

No

max_open_files

number

No

mode

string, one of ["tail", "read"]

No

path

array

Yes

sincedb_clean_after

number or string_duration

No

sincedb_path

string

No

sincedb_write_interval

number or string_duration

No

start_position

string, one of ["beginning", "end"]

No

stat_interval

number or string_duration

No

Also see Common Options for a list of options supported by allinput plugins.

 

close_older

The file input closes any files that were last read the specifiedduration (seconds if a number is specified) ago.This has different implications depending on if a file is being tailed orread. If tailing, and there is a large time gap in incoming data the filecan be closed (allowing other files to be opened) but will be queued forreopening when new data is detected. If reading, the file will be closedafter closed_older seconds from when the last bytes were read.This setting is retained for backward compatibility if you upgrade theplugin to 4.1.0+, are reading not tailing and do not switch to using Read mode.

delimiter

  • Value type is string
  • Default value is "\n"

set the new line delimiter, defaults to "\n". Note that when reading compressed filesthis setting is not used, instead the standard Windows or Unix line endings are used.

discover_interval

  • Value type is number
  • Default value is 15

How often we expand the filename patterns in the path option to discover new files to watch.This value is a multiple to stat_interval, e.g. if stat_interval is "500 ms" then new filesfiles could be discovered every 15 X 500 milliseconds - 7.5 seconds.In practice, this will be the best case because the time taken to read new content needs to be factored in.

exclude

  • Value type is array
  • There is no default value for this setting.

Exclusions (matched against the filename, not full path). Filenamepatterns are valid here, too. For example, if you have

path => "/var/log/*"

In Tail mode, you might want to exclude gzipped files:

exclude => "*.gz"

file_chunk_count

  • Value type is number
  • Default value is 4611686018427387903

When combined with the file_chunk_size, this option sets how many chunks (bands or stripes)are read from each file before moving to the next active file.For example, a file_chunk_count of 32 and a file_chunk_size 32KB will process the next 1MB from each active file.As the default is very large, the file is effectively read to EOF before moving to the next active file.

file_chunk_size

  • Value type is number
  • Default value is 32768 (32KB)

File content is read off disk in blocks or chunks and lines are extracted from the chunk.See file_chunk_count to see why and when to change this settingfrom the default.

file_completed_action

  • Value can be any of: delete, log, log_and_delete
  • The default is delete.

When in read mode, what action should be carried out when a file is done with.If delete is specified then the file will be deleted. If log is specifiedthen the full path of the file is logged to the file specified in thefile_completed_log_path setting. If log_and_delete is specified thenboth above actions take place.

file_completed_log_path

  • Value type is string
  • There is no default value for this setting.

Which file should the completely read file paths be appended to. Only specifythis path to a file when file_completed_action is log or log_and_delete.IMPORTANT: this file is appended to only - it could become very large. You areresponsible for file rotation.

file_sort_by

  • Value can be any of: last_modified, path
  • The default is last_modified.

Which attribute of a "watched" file should be used to sort them by.Files can be sorted by modified date or full path alphabetic.Previously the processing order of the discovered and therefore"watched" files was OS dependent.

file_sort_direction

  • Value can be any of: asc, desc
  • The default is asc.

Select between ascending and descending order when sorting "watched" files.If oldest data first is important then the defaults of last_modified + asc are good.If newest data first is more important then opt for last_modified + desc.If you use special naming conventions for the file full paths then perhapspath + asc will help to control the order of file processing.

ignore_older

When the file input discovers a file that was last modifiedbefore the specified duration (seconds if a number is specified), the file is ignored.After it’s discovery, if an ignored file is modified it is nolonger ignored and any new data is read. By default, this option isdisabled. Note this unit is in seconds.

max_open_files

  • Value type is number
  • There is no default value for this setting.

What is the maximum number of file_handles that this input consumesat any one time. Use close_older to close some files if you need toprocess more files than this number. This should not be set to themaximum the OS can do because file handles are needed for otherLS plugins and OS processes.A default of 4095 is set in internally.

mode

  • Value can be either tail or read.
  • The default value is tail.

What mode do you want the file input to operate in. Tail a few files orread many content-complete files. Read mode now supports gzip file processing.If "read" is specified then the following other settings are ignored:

  1. start_position (files are always read from the beginning)
  2. close_older (files are automatically closed when EOF is reached)

If "read" is specified then the following settings are heeded:

  1. ignore_older (older files are not processed)
  2. file_completed_action (what action should be taken when the file is processed)
  3. file_completed_log_path (which file should the completed file path be logged to)

path

  • This is a required setting.
  • Value type is array
  • There is no default value for this setting.

The path(s) to the file(s) to use as an input.You can use filename patterns here, such as /var/log/*.log.If you use a pattern like /var/log/**/*.log, a recursive searchof /var/log will be done for all *.log files.Paths must be absolute and cannot be relative.

You may also configure multiple paths. See an exampleon the Logstash configuration page.

sincedb_clean_after

  • Value type is number or string_duration
  • The default value for this setting is "2 weeks".
  • If a number is specified then it is interpreted as days and can be decimal e.g. 0.5 is 12 hours.

The sincedb record now has a last active timestamp associated with it.If no changes are detected in a tracked file in the last N days its sincedbtracking record expires and will not be persisted.This option helps protect against the inode recycling problem.Filebeat has a FAQ about inode recycling.

sincedb_path

  • Value type is string
  • There is no default value for this setting.

Path of the sincedb database file (keeps track of the currentposition of monitored log files) that will be written to disk.The default will write sincedb files to <path.data>/plugins/inputs/fileNOTE: it must be a file path and not a directory path

sincedb_write_interval

How often (in seconds) to write a since database with the current position ofmonitored log files.

start_position

  • Value can be any of: beginning, end
  • Default value is "end"

Choose where Logstash starts initially reading files: at the beginning orat the end. The default behavior treats files like live streams and thusstarts at the end. If you have old data you want to import, set thisto beginning.

This option only modifies "first contact" situations where a fileis new and not seen before, i.e. files that don’t have a currentposition recorded in a sincedb file read by Logstash. If a filehas already been seen before, this option has no effect and theposition recorded in the sincedb file will be used.

stat_interval

How often (in seconds) we stat files to see if they have been modified.Increasing this interval will decrease the number of system calls we make,but increase the time to detect new log lines.

Note

Discovering new files and checking whether they have grown/or shrunk occurs in a loop.This loop will sleep for stat_interval seconds before looping again. However, if fileshave grown, the new content is read and lines are enqueued.Reading and enqueuing across all grown files can take time, especially ifthe pipeline is congested. So the overall loop time is a combination of thestat_interval and the file read time.

Common Options

The following configuration options are supported by all input plugins:

Setting Input type Required

add_field

hash

No

codec

codec

No

enable_metric

boolean

No

id

string

No

tags

array

No

type

string

No

Details

 

add_field

  • Value type is hash
  • Default value is {}

Add a field to an event

codec

  • Value type is codec
  • Default value is "plain"

The codec used for input data. Input codecs are a convenient method for decoding your data before it enters the input, without needing a separate filter in your Logstash pipeline.

enable_metric

  • Value type is boolean
  • Default value is true

Disable or enable metric logging for this specific plugin instanceby default we record all the metrics we can, but you can disable metrics collectionfor a specific plugin.

id

  • Value type is string
  • There is no default value for this setting.

Add a unique ID to the plugin configuration. If no ID is specified, Logstash will generate one.It is strongly recommended to set this ID in your configuration. This is particularly usefulwhen you have two or more plugins of the same type, for example, if you have 2 file inputs.Adding a named ID in this case will help in monitoring Logstash when using the monitoring APIs.

input {  file {    id => "my_plugin_id"  }}

tags

  • Value type is array
  • There is no default value for this setting.

Add any number of arbitrary tags to your event.

This can help with processing later.

type

  • Value type is string
  • There is no default value for this setting.

Add a type field to all events handled by this input.

Types are used mainly for filter activation.

The type is stored as part of the event itself, so you canalso use the type to search for it in Kibana.

If you try to set a type on an event that already has one (forexample when you send an event from a shipper to an indexer) thena new input will not override the existing type. A type set atthe shipper stays with that event for its life evenwhen sent to another Logstash server.

String Durations

Format is number string and the space between these is optional.So "45s" and "45 s" are both valid.

Tip

Use the most suitable duration, for example, "3 days" rather than "72 hours".

Weeks

Supported values: w week weeks, e.g. "2 w", "1 week", "4 weeks".

Days

Supported values: d day days, e.g. "2 d", "1 day", "2.5 days".

Hours

Supported values: h hour hours, e.g. "4 h", "1 hour", "0.5 hours".

Minutes

Supported values: m min minute minutes, e.g. "45 m", "35 min", "1 minute", "6 minutes".

Seconds

Supported values: s sec second seconds, e.g. "45 s", "15 sec", "1 second", "2.5 seconds".

Milliseconds

Supported values: ms msec msecs, e.g. "500 ms", "750 msec", "50 msecs

Note

milli millis and milliseconds are not supported

Microseconds

Supported values: us usec usecs, e.g. "600 us", "800 usec", "900 usecs"

Note

micro micros and microseconds are not supported