Elasticsearch Performance and Tuning

A dedicated performance course run by Matt Gregory from Elastic, an absolute legend with deep Elasticsearch expert. Contents Cool takeaways Tuning for Index Speed Increase the refresh interval Index architecting Bulk Hardware settings to improve performance Disable swapping Indexing Buffer size Best practices and scaling Disable replics for initial loads Use auto-generated IDs Use Cross Cluster Replication Thread Pools Memory Locking Transforms Tuning for search API settings and data modelling to improve search performance Search as few fields as possible One big copy_to field as opposed to individual text multi field Consider mapping identifiers as keywords Document modeling Consider mapping numeric fields as keyword Hardware settings to improve search Warm Up Global Ordinals Warm up filesystem Cache Use index sorting to speed up search Ways to improve searches must and should clauses filter and must not clauses node query cache shard request cache Aggregation performance Search rounded dates Force merge read only indices Search profiler and Explain API Search profiler Search profiler API ID Query section Timing breakdown Collection section Collectors reasons Rewrite section Explain and Tasks API Explain API Score Field length normalization and coordindation Other Query Parameters API Settings to improve indexing performance Hardware settings to improve performance Best Practices and scaling Transforms Cool takeaways Increase the refresh_interval from default 1s to something higher, like 10s. Index typings should be set to strict (default is dynamic) The took param measures raw cluster operation speed, kibana will also reveal a roundtrip time which includes the HTTP layer. Auto generated id’s are always faster One of Matt’s favourite APIs _cluster/allocation/explain Ensure the heap is beefed up a must clause is the first line of defence for scoring, should is then used as the second pass of scoring always format queries as a ‘bool’ Configuration management everywhere (Ansible, etc) dedicated monitoring cluster Tuning for Index Speed Cheatsheet: ...

June 7, 2024 · 14 min

Elasticsearch Engineer 8.1

Revised 2024 edition based on Elasticsearch 8.1. Recently the opportunity to attend the latest revision of the 4-day Elasticsearch engineer course, which I did in-person about 5 years ago in Sydney. Elasticsearch has often been an integral part of the data solutions I’ve been involved with and I’m quite fond of it. This time round the course only runs in a virtual class room format (using strigo.io) with our awesome trainers Krishna Shah and Kiju Kim. ...

June 2, 2024 · 60 min

Kustomize

Kustomize is built into kubectl with -k. Great samples on kubernetes.io/docs Kustomize provides a template-free way to customize kubernetes manifests Contents: Generating resources Setting cross cutting fields Composing and customizing resources Composing Customizing Patches Images Replacements Reference In a nutshell provides 3 key features: generating resources from other sources setting cross-cutting fields for resources composing and customizing collections of resources Generating resources To generate a ConfigMap from an .env file, add an entry to the envs list in configMapGenerator. Kustomize supports other formats such as .properties. ...

May 3, 2024 · 3 min

Python Packaging (2024)

sysconfig Project Structure Setup Script Distribution Archives Distribution sysconfig The process of bundling Python code into a format that eases distribution and sharing. First up, I find it helps to get a concrete understanding of how the specific python distro I’m working with is configured, of paricular interest are the various system paths that will be visited for package dependencies. The built-in sysconfig module neatly manages and surfaces this information. python -m sysconfig On a Windows system: ...

December 29, 2023 · 3 min

Kubernetes Certified Administrator (CKA) 2024

CKA topics Kubernetes in a nutshell Lab environment kubeadm init sample output Buliding kubernetes clusters Networking kubeadm kubectl Contexts Resources CKA topics Cluster Architecture, Installation & Configuration: How to set up and configure a Kubernetes cluster, including how to install and configure a Kubernetes cluster using kubeadm, how to upgrade your cluster version, how to backup and restore an etcd cluster, and how to configure a pod to use secrets Workloads & Scheduling: How to deploy a Kubernetes application, create daemonsets, scale the application, configure health checks, use multi-container pods, and use config maps and secrets in a pod. You’ll also need to know how to expose your application using services Services & Networking: How to expose applications within the cluster or outside the cluster, how to manage networking policies, and how to configure ingress controllers Storage: How to create and configure persistent volumes, how to create and configure persistent volume claims, and how to expand persistent volumes Troubleshooting: How to troubleshoot common issues in a Kubernetes environment, including how to diagnose and resolve issues with pods, nodes, and network traffic Kubernetes in a nutshell Control plane management components that mother-hen nodes and pods. Key components: ...

December 22, 2023 · 7 min

Python PR Checklist

A modified version of excellent original checklist by Paul Wolf General Code is blackened with black ruff has been run with no errors mypy has been run with no errors Function complexity problems have been resolved using the default complexity index of flake8. Important core code can be loaded in iPython, ipdb easily. There is no dead code Comprehensions or generator expressions are used in place of for loops where appropriate Comprehensions and generator expressions produce state but they do not have side effects within the expression. Use zip(), any(), all(), filter(), etc. instead of for loops where appropriate Functions that take as parameters and mutate mutable variables don’t return these variables. They return None. Return immutable copies of mutable types instead of mutating the instances themselves when mutable types are passed as parameters with the intention of returning a mutated version of that variable. Avoid method cascading on objects with methods that return self. Function and method parameters never use empty collection or sequence instances like list [] or dict {}. Instead they must use None to indicate missing input Variables in a function body are initialised with empty sequences or collections by callables, list(), dict(), instead of [], {}, etc. Always use the Final type hint for class instance parameters that will not change. Context-dependent variables are not unnecessarily passed between functions or methods View functions either implement the business rules the view is repsonsible for or it passes data downstream to have this done by services and receives non-context dependent data back. View functions don’t pass request to called functions Functions including class methods don’t have too many local parameters or instance variables. Especially a class’ __init__() should not have too many parameters. Profiling code is minimal Logging is the minimum required for production use There are no home-brewed solutions for things that already exist in the PSL (python standard library) n00b habbits Bare except clause, Python uses exceptions to flag system level interupts such as sigkills. Don’t do this. Argument default mutatable arguments such as def foo(bar=[]) are defined when the function is defined, not when its run, and will result in a all function calls sharing the same instance of bar Checking for equality using ==. Due to inheritance this is not desirable as it pins to a concrete type and not potentially it descendents. In other words the Liskov substitution principle. Instead isinstance(p, tuple) Explicit bool or length checks, such as if bool(x) or if len(x) > 0 is redundant, as Python has sane truthy evaluation. Use of range over the for in idiom If you really need the index, always use enumerate Not using items() on a dict for k, v in dict.items()) Using time.time to measure code performance. Use time.perf_counter instead. Using print statements over logging Using import * will normally liter the namespace with variable. Dont be lazy, be specific. from itertools import count Imports and modules Imports are sorted by isort or according to some standard that is consistent within the team Import packages or modules to qualify the use of functions or classes so that unqualified function calls can be assumed to be to functions in the current module Documentation Modules have docstrings Classes have docstrings unless their purpose is immediately obvious Methods and functions have docstrings Comments and docstrings add non-obvious and helpful information that is not already present in the naming of functions and variables General Complexity Functions as complex as they need to be but no more (as defined by flake8\ ’s default complexity threshold) Classes have only as many methods as required and have a simple hierarchy Context Freedom All important functionality can be loaded easily in ipython without having to construct dummy requests, etc. All important functionality can be loaded in pdb (or a variant, ipdb, etc.) Types Use immutable types ()tuple, frozenset, Enum, etc) over mutable types whenever possible Functions Functions are pure wherever possible, i.e. they take input and provide a return value with no side-effects or reliance on hidden state. Modules Module level variables do not take context-dependent values like connection clients to remote systems unless the client is used immediately for another module level variable and not used again Classes Every class has a single well-defined purpose. That is, the class does not mix up different tasks, like remote state acquisition, web sockets notification, data formatting, etc. Classes manage state and do not just represent the encapsulation of behaviour All methods access either cls or self in the body. If a method does not access cls or self, it should be a function at module level. @classmethod is used in preference to @staticmethod but only if the method body accesses cls otherwise the method should be a module level function. Constants are declared at module level not in methods or class level Constants are always upper case Abstract classes are derived from abc: from abc import ABC Abstract methods use the @abstractmethod decorator Abstract class properties use both @abstractmethod and @property decorators Classes do not use multiple inheritance Classes do not use mixins (use composition instead) except in rare cases Class names do not use the word “Base” to signal they are the single ancestor, like “BaseWhatever” Decorators are not used to replace classes as a design pattern __init__() does not define too many local variables. Use the Parameter Consolidation pattern instead. A factory class or function at module level is used for complex class construction (see Design Patterns) to achieve composition Classes are not dynamically created from strings except where forward reference requires this Design Patterns Do not use designs that cause a typical Python developer to have to learn new semantics that are unexpected in Python Classes primarily use composition in preference to inheritance Beyond a very small number of simple variables, a class’ purpose is to acquire state for another class or it uses another class to acquire state in particular if the state is from a remote service. If you use the Context Parameter pattern, it is critical that the state of the context does not change after calling its __init__(), i.e. it should be immutable If a class’ purpose is to represent an external integration, you probably want numerous classes to compose the service: RemoteDataClient, DomainManager, ContextManager, Factory, NotificationController, DomainResponse, DataFormatter and so on.

October 30, 2023 · 5 min

Async Python

Background Using asyncio will not make your code multi-threaded. That is, it will not cause multiple Python instructions to be executed at the same time, and it will not in any way allow you to side step the so-called “global interpreter lock” (GIL). Some processes are CPU-bound: they consist of a series of instructions which need to be executed one after another until the result has been computed. Most of their time is spent making heavy use of the processor. ...

August 9, 2023 · 9 min

Python Type Annotations

Start with the docs and the Type hints cheat sheet Topics for consideration: syntax shorthands e.g. | for Union or Optional Self If you are using the typing library then there is an abstract type class provided for asynchronous context managers AsyncContextManager[T], where T is the type of the object which will be bound by the as clause of the async with statement. mypy If you are using typing then there is an abstract class Awaitable which is generic, so that Awaitable[R] for some type R means anything which is awaitable, and when used in an await statement will return something of type R. ...

August 9, 2023 · 1 min

Python Standard Libraries

An important part of becoming “good” at a language is becoming familiar with its library eco-system. The official Python Standard Library reference manual rocks. Module Category Description argparse functions for parsing command line arguments atexit allows you to register functions for your program to call when it exits bisect bisection algorithms for sorting lists (see Chapter 10) calendar a number of date-related functions codecs functions for encoding and decoding data collections a variety of useful data structures concurrent asynchronous computation copy functions for copying data csv functions for reading and writing CSV files datetime classes for handling dates and times fileinput file access iterate over lines from multiple files or input streams fnmatch functions for matching Unix-style filename patterns glob functions for matching Unix-style path patterns io functions for handling I/O streams and StringIO, which allows you to treat strings as files. json functions for reading and writing data in JSON format logging access to Python’s own built-in logging functionality multiprocessing allows you to run multiple subprocesses, while providing an API that makes them look like threads operator functions implementing the basic Python operators, instead of writing your own lambda expressions os swiss army knife access to basic OS functions pprint data types data pretty printer random functions for generating pseudorandom numbers re regular expression functionality sched an event scheduler without using multithreading select access to the select() and poll() functions for creating event loops shutil file access access to high-level file functions signal functions for handling POSIX signals tempfile file access functions for creating temporary files and directories threading access to high-level threading functionality urllib provides functions for handling and parsing URLs uuid allows you to generate Universally Unique Identifiers (UUIDs)

August 4, 2023 · 2 min

Objects in Python

Special methods (dunders) Foundational Iterators Compariable classes Serializable classes Classes with computed attributes Classes that are callable Classes that act like sets Classes that act like dictionaries Classes that act like numbers Classes that can be used in a with block Esoteric behavior Design Patterns As I learn more about Pythons idioms reflect on its unique approach to object based programming. In combination with duck typing its approach to objects feels distrubingly flexible. ...

August 3, 2023 · 9 min