Crane 推荐组件源码笔记

February 12, 2025 • 7 分钟

选自我的笔记，项目非常庞大，希望给各位同好一点启发，这里主要还是推荐部分模块的源码走读

架构

The project structure is organized into several directories and files, each serving a specific purpose. Here’s a brief overview of the main components

cmd/
- metric-adapter/: Contains code related to the metric adapter component
- craned/: Contains the main application code for the craned command
  - app/: Likely contains application-specific logic and configurations
  - main.go: The entry point for the craned application
- crane-agent/: Contains code related to the crane agent component
deploy/: Contains deployment scripts and configurations for deploying the application
docs/: Documentation files for the project
examples/: Example configurations or usage examples for the project
hack/: Scripts and utilities for development and maintenance tasks
overrides/: Possibly contains configuration overrides or custom settings
pkg/: Contains the core library code, which can be used by other applications or commands
site/: Could be related to the project’s website or documentation site
tools/: Additional tools or scripts that support the project
Files:
- go.mod and go.sum: Go module files that manage dependencies
- README.md and README_zh.md: Documentation files providing an overview and instructions for the project
- Makefile: Contains build instructions and tasks for the project
- Dockerfile: Instructions for building a Docker image of the application
- .gitignore: Specifies files and directories to be ignored by Git
- .golangci.yaml: Configuration for GolangCI, a Go linter
- CONTRIBUTING.md and CODE_OF_CONDUCT.md: Guidelines for contributing to the project and expected behavior

Manager 部分

Command Initialization:
- NewManagerCommand(ctx context.Context) *cobra.Command:
  - Creates a new Cobra command for craned
  - Initializes options using options.NewOptions()
  - Sets up command flags and feature gates
Command Execution:
- The Run function within the Cobra command is executed:
  - Calls opts.Complete() to complete option setup
  - Validates options with opts.Validate()
  - Executes Run(ctx, opts) to start the manager
Run Function:
- Run(ctx context.Context, opts *options.Options) error:
  - Configures the controller manager with ctrl.GetConfigOrDie()
  - Sets up controller options, including leader election and metrics binding
  - Creates a new manager with ctrl.NewManager(config, ctrlOptions)
Health Checks:
- Adds health and readiness checks using mgr.AddHealthzCheck and mgr.AddReadyzCheck
Data Sources and Predictor Initialization:
- initDataSources(mgr, opts): Initializes real-time and historical data sources
- initPredictorManager(opts, realtimeDataSources, historyDataSources): Sets up the predictor manager using the initialized data sources
Scheme and Indexer Initialization:
- initScheme(): Registers API types with the runtime scheme
- initFieldIndexer(mgr): Sets up field indexers for efficient querying
Webhooks Initialization:
- initWebhooks(mgr, opts): Configures webhooks if necessary
Pod OOM Recorder:
- Initializes oom.PodOOMRecorder to track out-of-memory events
- Sets up the recorder with the manager and runs it in a separate goroutine
Controller Initialization:
- initControllers(ctx, podOOMRecorder, mgr, opts, predictorMgr, historyDataSources[providers.PrometheusDataSource]): Initializes various controllers, including analytics and recommendation controllers
Metric Collector Initialization:

initMetricCollector(mgr): Sets up custom metrics collection

Running All Components:

runAll(ctx, mgr, predictorMgr, dataSourceProviders[providers.PrometheusDataSource], opts):
- Starts all components, ensuring that the manager and predictor manager are running

Controllers 部分

PodOOMRecorder

SetupWithManager:
- SetupWithManager(mgr ctrl.Manager) error:
  - This function is called to register the PodOOMRecorder with the controller manager
  - It sets up the reconciler to watch for pod events and trigger the Reconcile function
Reconcile Function:
- Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error):
  - Triggered by the controller manager for each pod event
  - Retrieves the pod specified in the request
  - Checks if the pod was terminated due to an OOM event using IsOOMKilled(pod)
  - If OOMKilled, iterates over container statuses to find the terminated container
  - For each OOMKilled container, checks if memory requests are set
  - Increments the OOM count metric and adds an OOMRecord to the queue
Run Function:
- Run(stopCh <-chan struct{}) error:
  - Continuously processes the OOM records from the queue
  - Uses a ticker to periodically clean old OOM records to prevent exceeding ConfigMap storage limits
  - Retrieves OOM records using GetOOMRecord
  - Updates OOM records with updateOOMRecord
GetOOMRecord Function:
- GetOOMRecord() ([]OOMRecord, error):
  - Retrieves OOM records from the cache or ConfigMap
  - If the cache is nil, it fetches the ConfigMap and unmarshals the OOM records
cleanOOMRecords Function:
- cleanOOMRecords(oomRecords []OOMRecord) []OOMRecord:
  - Cleans up old OOM records to maintain the maximum number specified by OOMRecordMaxNumber
updateOOMRecord Function:
- updateOOMRecord(oomRecord OOMRecord, saved []OOMRecord) error:
  - Updates the OOM records in the cache or storage with new records
IsOOMKilled Function:
- IsOOMKilled(pod *v1.Pod) bool:
  - Helper function to determine if a pod was terminated due to an OOM event

Detailed Flow

Initialization: The PodOOMRecorder is set up with the manager using SetupWithManager, which registers it to watch for pod events
Event Handling: For each pod event, Reconcile is called to check for OOM events and record them
Record Management: The Run function processes the queue of OOM records, periodically cleans old records, and updates the storage
Data Retrieval: GetOOMRecord fetches the current OOM records, either from the cache or the ConfigMap

This detailed explanation provides a comprehensive view of the PodOOMRecorder’s functionality and its role in tracking OOM events in a Kubernetes cluster

GetOOMRecord Function

Mutex Locking:
- The function begins by acquiring a lock using r.mu.Lock() to ensure thread-safe access to the cache. It releases the lock with defer r.mu.Unlock()
Cache Check:
- It checks if the cache is nil. If the cache is not initialized, it proceeds to fetch the OOM records from a Kubernetes ConfigMap
ConfigMap Retrieval:
- Uses the Kubernetes client (r.Client) to get the ConfigMap that stores OOM records
- The ConfigMap is identified by its namespace (known.CraneSystemNamespace) and name (ConfigMapOOMRecordName)
Error Handling:
- If the ConfigMap is not found (apierrors.IsNotFound(err)), it returns nil, indicating no records are available
- For other errors, it returns the error to the caller
Data Unmarshalling:
- If the ConfigMap is found, it retrieves the data field (ConfigMapDataOOMRecord) containing the OOM records
- Uses json.Unmarshal to convert the JSON data into a slice of OOMRecord structs
Return Records:
- Returns the unmarshalled OOM records or the cached records if the cache is already initialized

Kubernetes Components and Principles

Kubernetes Client:
- The function uses the client.Client interface from the Kubernetes client-go library to interact with the Kubernetes API server
- This client provides methods to perform CRUD operations on Kubernetes resources, such as ConfigMaps
ConfigMap:
- A ConfigMap is a Kubernetes resource used to store non-confidential data in key-value pairs
- In this context, it is used to persist OOM records across pod restarts or application restarts
Concurrency:
- The use of a mutex (sync.Mutex) ensures that access to shared resources (like the cache) is synchronized, preventing race conditions
Error Handling:
- The function handles errors gracefully, distinguishing between not-found errors and other types of errors to provide appropriate responses

This detailed breakdown explains how the GetOOMRecord function retrieves OOM records using Kubernetes components and concurrency principles

IsOOMKilled Function

Function Purpose:
- The IsOOMKilled function checks if any container within a given pod was terminated due to an OOM event
Iterating Over Container Statuses:
- The function iterates over the ContainerStatuses slice in the pod’s status. This slice contains the status of each container in the pod
Checking Conditions:
- For each containerStatus, it checks the following conditions:
  - Restart Count: The RestartCount must be greater than 0, indicating that the container has been restarted
  - Termination State: The LastTerminationState.Terminated must not be nil, meaning the container has a recorded termination state
  - Termination Reason: The Reason for termination must be "OOMKilled", indicating that the container was terminated due to an OOM event
Return Value:
- If all the above conditions are met for any container, the function returns true, indicating that the pod was OOM killed
- If none of the containers meet these conditions, the function returns false

Kubernetes Components and Principles

Pod Status:
- The function uses the v1.Pod type from the Kubernetes API, which includes the status of the pod and its containers
- The ContainerStatuses field provides detailed information about each container’s state, including termination details
OOMKilled Reason:
- The "OOMKilled" reason is a standard Kubernetes termination reason indicating that a container was terminated due to exceeding its memory limit

This explanation provides a clear understanding of how the IsOOMKilled function operates, using Kubernetes pod status information to determine if a pod was terminated due to an OOM event

MetricCollector

CraneMetricCollector Structure

Purpose:
- The CraneMetricCollector is responsible for collecting and exposing metrics related to autoscaling and predictions in a Kubernetes environment
Fields:
- Client: A Kubernetes client for interacting with the cluster
- Scaler: A scale.ScalesGetter for retrieving scale subresources
- RESTMapper: A meta.RESTMapper for mapping resources to their REST paths
- Prometheus Descriptors:
  - metricAutoScalingCron: Describes external metrics for cron-based HPA
  - metricAutoScalingPrediction: Describes external metrics for prediction-based HPA
  - metricPredictionTsp: Describes model metrics for time series prediction (TSP)
  - metricMetricRuleError: Describes metrics related to rule errors
PredictionMetric Structure:
- Represents a prediction metric with fields for description, target details, algorithm, metric value, and timestamp

Functions

NewCraneMetricCollector:
- Initializes a new CraneMetricCollector with the provided Kubernetes client, scale client, and REST mapper
Describe:
- Implements the prometheus.Collector interface
- Sends the descriptors of each metric to the provided channel
Collect:
- Implements the prometheus.Collector interface
- Collects metrics and sends them to the provided channel
- Calls helper functions to gather specific metrics:
  - getMetricsTsp: Collects metrics related to time series predictions
  - getMetricsCron: Collects metrics related to cron-based autoscaling
  - getMetricsRuleError: Collects metrics related to rule errors
getMetricsTsp:
- Retrieves metrics from TimeSeriesPrediction resources
- Converts them into PredictionMetric objects
getMetricsCron:
- Retrieves metrics from EffectiveHorizontalPodAutoscaler resources
- Converts them into Prometheus metrics
getMetricsRuleError:
- Collects metrics related to errors in metric rules
computePredictionMetric:
- Computes prediction metrics from TimeSeriesPrediction resources
- Maps prediction metrics to their statuses
Utility Functions:
- AggregateSignalKey: Aggregates signal keys from labels
- MetricContains: Checks if a prediction metric is contained within a list

Prometheus Integration

The CraneMetricCollector integrates with Prometheus by implementing the prometheus.Collector interface
It defines and collects custom metrics, which are exposed to Prometheus for monitoring and alerting

This detailed breakdown explains the role and functionality of the CraneMetricCollector in collecting and exposing metrics for autoscaling and predictions

RecommenderManager

Purpose:
- The RecommenderManager interface is responsible for managing and providing access to different recommenders based on their names and rules
Methods:
- GetRecommender: Returns a registered recommender by its name
- GetRecommenderWithRule: Returns a registered recommender with its configuration merged with a given RecommendationRule

Manager Implementation

NewRecommenderManager Function:
- Initializes a new RecommenderManager with a given recommendation configuration file
- Calls loadConfigFile to load recommender configurations
- Starts a goroutine to watch the configuration file for changes using watchConfigFile
Manager Struct:
- Fields:
  - recommendationConfiguration: Path to the configuration file
  - lock: A mutex for synchronizing access to recommenderConfigs
  - recommenderConfigs: A map storing configurations for each recommender
GetRecommender Method:
- Calls GetRecommenderWithRule with an empty RecommendationRule to retrieve a recommender by name
GetRecommenderWithRule Method:
- Locks the recommenderConfigs map to ensure thread-safe access
- Checks if the requested recommender exists in the recommenderConfigs
- If found, calls recommender.GetRecommenderProvider to get the recommender with the merged configuration
- Returns an error if the recommender name is unknown
watchConfigFile Method:
- Uses fsnotify.NewWatcher to monitor the configuration file for changes
- Reloads the configuration when changes are detected
loadConfigFile Method:
- Loads recommender configurations from the specified file
- Parses the file and populates the recommenderConfigs map

Kubernetes Components and Principles

Concurrency:
- Uses a mutex (sync.Mutex) to ensure thread-safe access to shared resources like recommenderConfigs
File Watching:
- Utilizes fsnotify to watch the configuration file for changes, allowing dynamic updates to recommender configurations
Configuration Management:
- Manages recommender configurations through a centralized configuration file, enabling easy updates and management

This detailed breakdown explains the role and functionality of the RecommenderManager in managing and providing access to recommenders based on configurations and rules

Recommendation Process Flow

Data Collection:
- Prometheus Integration:
  - The system collects metrics from Prometheus, which serves as a data source for historical and real-time metrics
  - The providers.History interface is used to fetch historical data from Prometheus
Recommendation Rule Controller:
- Reconcile Function:
  - The RecommendationRuleController watches for RecommendationRule resources
  - It retrieves the RecommendationRule and checks if it needs to be processed based on its RunInterval and LastUpdateTime
  - If the rule is due for execution, it calls doReconcile
Resource Identification:
- getIdentities Function:
  - Identifies target resources based on the ResourceSelectors in the RecommendationRule
  - Uses dynamic client and listers to fetch resources matching the selectors
Recommendation Execution:
- doReconcile Function:
  - Prepares a list of ObjectIdentity for each target resource
  - Determines if a new recommendation round is needed based on the RunInterval
  - Executes recommendations concurrently for identified resources using executeIdentity
Recommender Manager:
- GetRecommenderWithRule Function:
  - Retrieves the appropriate recommender based on the RecommendationRule
  - Merges the rule’s configuration with the recommender’s default configuration
Recommendation Calculation:
- executeIdentity Function:
  - Creates or updates a Recommendation object for each target resource
  - Calls the Run method of the recommender, passing a RecommendationContext
  - The recommender uses the context to fetch data from Prometheus and perform calculations
Writing to CRD:
- Recommendation Object:
  - The calculated recommendation is stored in a Recommendation CRD
  - The executeIdentity function updates the Recommendation object with the computed values and status
  - Uses the Kubernetes client to create or update the Recommendation CRD in the cluster
Status Update:
- updateRecommendationRuleStatus Function:
  - Updates the status of the RecommendationRule with the latest execution results
  - Ensures that the LastUpdateTime and other status fields are correctly set

Kubernetes Components and Principles

Dynamic Client:
- Used to interact with Kubernetes resources dynamically, allowing the system to handle various resource types
Concurrency:
- Recommendations are executed concurrently for multiple resources to improve efficiency
CRD Management:
- Custom Resource Definitions (CRDs) are used to store recommendations, providing a structured way to manage and access recommendation data
Prometheus Data:
- Prometheus serves as a key data source, providing the metrics needed for making informed recommendations

This detailed breakdown explains the end-to-end process of generating recommendations from Prometheus data and writing them into CRDs

Prometheus Data Retrieval Process

Provider Initialization:
- NewProvider Function:
  - The process begins with the initialization of a Prometheus data provider using the NewProvider function
  - This function takes a PromConfig as input, which contains configuration details like the Prometheus server address and authentication information
  - It calls NewPrometheusClient to create a Prometheus client
Prometheus Client Creation:
- NewPrometheusClient Function:
  - This function sets up a Prometheus client using the provided configuration
  - It configures HTTP transport settings, including TLS and proxy settings
  - If rate limiting is enabled, it creates a prometheusRateLimitClient; otherwise, it creates a prometheusAuthClient
Query Execution:
- QueryTimeSeries Function:
  - This function is responsible for querying time series data from Prometheus
  - It constructs a PromQL query using a MetricNamer, which provides a query builder for the specific metric source
  - The query is built using the BuildQuery method of the query builder
Context and Timeout Management:
- A context with a timeout is created using gocontext.WithTimeout to ensure the query does not run indefinitely
- The context is used to execute the query, allowing for cancellation if the timeout is reached
Query Range Execution:
- QueryRangeSync Method:
  - The QueryRangeSync method of the context struct is called to execute the query over a specified time range
  - It sends the query to the Prometheus server and retrieves the time series data
Data Processing:
- The retrieved data is processed into a slice of common.TimeSeries, which represents the time series data in a structured format
Error Handling:
- Throughout the process, errors are logged using klog to provide detailed information about any issues encountered during query execution

Key Components and Principles

Prometheus Client:
- The Prometheus client is responsible for sending HTTP requests to the Prometheus server and handling responses
- It supports authentication and rate limiting to manage query load
PromQL:
- PromQL is the query language used to specify the metrics and time range for data retrieval
- The system constructs PromQL queries dynamically based on the metrics needed for analysis
Concurrency and Rate Limiting:
- The system uses rate limiting to control the number of concurrent queries sent to Prometheus, preventing overload
- The prometheusRateLimitClient manages in-flight requests and blocks new requests if the limit is reached

This detailed breakdown explains the process of retrieving data from Prometheus, highlighting the function calls and components involved in the workflow

字体选项

Crane 推荐组件源码笔记

架构

Manager 部分

Controllers 部分

PodOOMRecorder

Detailed Flow

GetOOMRecord Function

Kubernetes Components and Principles

IsOOMKilled Function

Kubernetes Components and Principles

MetricCollector

CraneMetricCollector Structure

Functions

Prometheus Integration

RecommenderManager

Manager Implementation

Kubernetes Components and Principles

Recommendation Process Flow

Kubernetes Components and Principles

Prometheus Data Retrieval Process

Key Components and Principles