选自我的笔记,项目非常庞大,希望给各位同好一点启发,这里主要还是推荐部分模块的源码走读
架构
The project structure is organized into several directories and files, each serving a specific purpose. Here’s a brief overview of the main components
- cmd/
- metric-adapter/: Contains code related to the metric adapter component
- craned/: Contains the main application code for the
craned
command- app/: Likely contains application-specific logic and configurations
main.go
: The entry point for thecraned
application
- crane-agent/: Contains code related to the crane agent component
- deploy/: Contains deployment scripts and configurations for deploying the application
- docs/: Documentation files for the project
- examples/: Example configurations or usage examples for the project
- hack/: Scripts and utilities for development and maintenance tasks
- overrides/: Possibly contains configuration overrides or custom settings
- pkg/: Contains the core library code, which can be used by other applications or commands
- site/: Could be related to the project’s website or documentation site
- tools/: Additional tools or scripts that support the project
- Files:
go.mod
andgo.sum
: Go module files that manage dependenciesREADME.md
andREADME_zh.md
: Documentation files providing an overview and instructions for the projectMakefile
: Contains build instructions and tasks for the projectDockerfile
: Instructions for building a Docker image of the application.gitignore
: Specifies files and directories to be ignored by Git.golangci.yaml
: Configuration for GolangCI, a Go linterCONTRIBUTING.md
andCODE_OF_CONDUCT.md
: Guidelines for contributing to the project and expected behavior
Manager 部分
- Command Initialization:
NewManagerCommand(ctx context.Context) *cobra.Command
:- Creates a new Cobra command for
craned
- Initializes options using
options.NewOptions()
- Sets up command flags and feature gates
- Creates a new Cobra command for
- Command Execution:
- The
Run
function within the Cobra command is executed:- Calls
opts.Complete()
to complete option setup - Validates options with
opts.Validate()
- Executes
Run(ctx, opts)
to start the manager
- Calls
- The
- Run Function:
Run(ctx context.Context, opts *options.Options) error
:- Configures the controller manager with
ctrl.GetConfigOrDie()
- Sets up controller options, including leader election and metrics binding
- Creates a new manager with
ctrl.NewManager(config, ctrlOptions)
- Configures the controller manager with
- Health Checks:
- Adds health and readiness checks using
mgr.AddHealthzCheck
andmgr.AddReadyzCheck
- Adds health and readiness checks using
- Data Sources and Predictor Initialization:
initDataSources(mgr, opts)
: Initializes real-time and historical data sourcesinitPredictorManager(opts, realtimeDataSources, historyDataSources)
: Sets up the predictor manager using the initialized data sources
- Scheme and Indexer Initialization:
initScheme()
: Registers API types with the runtime schemeinitFieldIndexer(mgr)
: Sets up field indexers for efficient querying
- Webhooks Initialization:
initWebhooks(mgr, opts)
: Configures webhooks if necessary
- Pod OOM Recorder:
- Initializes
oom.PodOOMRecorder
to track out-of-memory events - Sets up the recorder with the manager and runs it in a separate goroutine
- Initializes
- Controller Initialization:
initControllers(ctx, podOOMRecorder, mgr, opts, predictorMgr, historyDataSources[providers.PrometheusDataSource])
: Initializes various controllers, including analytics and recommendation controllers
- Metric Collector Initialization:
initMetricCollector(mgr)
: Sets up custom metrics collection
- Running All Components:
runAll(ctx, mgr, predictorMgr, dataSourceProviders[providers.PrometheusDataSource], opts)
:- Starts all components, ensuring that the manager and predictor manager are running
Controllers 部分
PodOOMRecorder
- SetupWithManager:
SetupWithManager(mgr ctrl.Manager) error
:- This function is called to register the
PodOOMRecorder
with the controller manager - It sets up the reconciler to watch for pod events and trigger the
Reconcile
function
- This function is called to register the
- Reconcile Function:
Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error)
:- Triggered by the controller manager for each pod event
- Retrieves the pod specified in the request
- Checks if the pod was terminated due to an OOM event using
IsOOMKilled(pod)
- If OOMKilled, iterates over container statuses to find the terminated container
- For each OOMKilled container, checks if memory requests are set
- Increments the OOM count metric and adds an
OOMRecord
to the queue
- Run Function:
Run(stopCh <-chan struct{}) error
:- Continuously processes the OOM records from the queue
- Uses a ticker to periodically clean old OOM records to prevent exceeding ConfigMap storage limits
- Retrieves OOM records using
GetOOMRecord
- Updates OOM records with
updateOOMRecord
- GetOOMRecord Function:
GetOOMRecord() ([]OOMRecord, error)
:- Retrieves OOM records from the cache or ConfigMap
- If the cache is nil, it fetches the ConfigMap and unmarshals the OOM records
- cleanOOMRecords Function:
cleanOOMRecords(oomRecords []OOMRecord) []OOMRecord
:- Cleans up old OOM records to maintain the maximum number specified by
OOMRecordMaxNumber
- Cleans up old OOM records to maintain the maximum number specified by
- updateOOMRecord Function:
updateOOMRecord(oomRecord OOMRecord, saved []OOMRecord) error
:- Updates the OOM records in the cache or storage with new records
- IsOOMKilled Function:
IsOOMKilled(pod *v1.Pod) bool
:- Helper function to determine if a pod was terminated due to an OOM event
Detailed Flow
- Initialization: The
PodOOMRecorder
is set up with the manager usingSetupWithManager
, which registers it to watch for pod events - Event Handling: For each pod event,
Reconcile
is called to check for OOM events and record them - Record Management: The
Run
function processes the queue of OOM records, periodically cleans old records, and updates the storage - Data Retrieval:
GetOOMRecord
fetches the current OOM records, either from the cache or the ConfigMap
This detailed explanation provides a comprehensive view of the PodOOMRecorder
’s functionality and its role in tracking OOM events in a Kubernetes cluster
GetOOMRecord Function
- Mutex Locking:
- The function begins by acquiring a lock using
r.mu.Lock()
to ensure thread-safe access to the cache. It releases the lock withdefer r.mu.Unlock()
- The function begins by acquiring a lock using
- Cache Check:
- It checks if the
cache
isnil
. If the cache is not initialized, it proceeds to fetch the OOM records from a Kubernetes ConfigMap
- It checks if the
- ConfigMap Retrieval:
- Uses the Kubernetes client (
r.Client
) to get the ConfigMap that stores OOM records - The ConfigMap is identified by its namespace (
known.CraneSystemNamespace
) and name (ConfigMapOOMRecordName
)
- Uses the Kubernetes client (
- Error Handling:
- If the ConfigMap is not found (
apierrors.IsNotFound(err)
), it returnsnil
, indicating no records are available - For other errors, it returns the error to the caller
- If the ConfigMap is not found (
- Data Unmarshalling:
- If the ConfigMap is found, it retrieves the data field (
ConfigMapDataOOMRecord
) containing the OOM records - Uses
json.Unmarshal
to convert the JSON data into a slice ofOOMRecord
structs
- If the ConfigMap is found, it retrieves the data field (
- Return Records:
- Returns the unmarshalled OOM records or the cached records if the cache is already initialized
Kubernetes Components and Principles
- Kubernetes Client:
- The function uses the
client.Client
interface from the Kubernetes client-go library to interact with the Kubernetes API server - This client provides methods to perform CRUD operations on Kubernetes resources, such as ConfigMaps
- The function uses the
- ConfigMap:
- A ConfigMap is a Kubernetes resource used to store non-confidential data in key-value pairs
- In this context, it is used to persist OOM records across pod restarts or application restarts
- Concurrency:
- The use of a mutex (
sync.Mutex
) ensures that access to shared resources (like the cache) is synchronized, preventing race conditions
- The use of a mutex (
- Error Handling:
- The function handles errors gracefully, distinguishing between not-found errors and other types of errors to provide appropriate responses
This detailed breakdown explains how the GetOOMRecord
function retrieves OOM records using Kubernetes components and concurrency principles
IsOOMKilled Function
- Function Purpose:
- The
IsOOMKilled
function checks if any container within a given pod was terminated due to an OOM event
- The
- Iterating Over Container Statuses:
- The function iterates over the
ContainerStatuses
slice in the pod’s status. This slice contains the status of each container in the pod
- The function iterates over the
- Checking Conditions:
- For each
containerStatus
, it checks the following conditions:- Restart Count: The
RestartCount
must be greater than 0, indicating that the container has been restarted - Termination State: The
LastTerminationState.Terminated
must not benil
, meaning the container has a recorded termination state - Termination Reason: The
Reason
for termination must be"OOMKilled"
, indicating that the container was terminated due to an OOM event
- Restart Count: The
- For each
- Return Value:
- If all the above conditions are met for any container, the function returns
true
, indicating that the pod was OOM killed - If none of the containers meet these conditions, the function returns
false
- If all the above conditions are met for any container, the function returns
Kubernetes Components and Principles
- Pod Status:
- The function uses the
v1.Pod
type from the Kubernetes API, which includes the status of the pod and its containers - The
ContainerStatuses
field provides detailed information about each container’s state, including termination details
- The function uses the
- OOMKilled Reason:
- The
"OOMKilled"
reason is a standard Kubernetes termination reason indicating that a container was terminated due to exceeding its memory limit
- The
This explanation provides a clear understanding of how the IsOOMKilled
function operates, using Kubernetes pod status information to determine if a pod was terminated due to an OOM event
MetricCollector
CraneMetricCollector Structure
- Purpose:
- The
CraneMetricCollector
is responsible for collecting and exposing metrics related to autoscaling and predictions in a Kubernetes environment
- The
- Fields:
- Client: A Kubernetes client for interacting with the cluster
- Scaler: A
scale.ScalesGetter
for retrieving scale subresources - RESTMapper: A
meta.RESTMapper
for mapping resources to their REST paths - Prometheus Descriptors:
metricAutoScalingCron
: Describes external metrics for cron-based HPAmetricAutoScalingPrediction
: Describes external metrics for prediction-based HPAmetricPredictionTsp
: Describes model metrics for time series prediction (TSP)metricMetricRuleError
: Describes metrics related to rule errors
- PredictionMetric Structure:
- Represents a prediction metric with fields for description, target details, algorithm, metric value, and timestamp
Functions
- NewCraneMetricCollector:
- Initializes a new
CraneMetricCollector
with the provided Kubernetes client, scale client, and REST mapper
- Initializes a new
- Describe:
- Implements the
prometheus.Collector
interface - Sends the descriptors of each metric to the provided channel
- Implements the
- Collect:
- Implements the
prometheus.Collector
interface - Collects metrics and sends them to the provided channel
- Calls helper functions to gather specific metrics:
getMetricsTsp
: Collects metrics related to time series predictionsgetMetricsCron
: Collects metrics related to cron-based autoscalinggetMetricsRuleError
: Collects metrics related to rule errors
- Implements the
- getMetricsTsp:
- Retrieves metrics from
TimeSeriesPrediction
resources - Converts them into
PredictionMetric
objects
- Retrieves metrics from
- getMetricsCron:
- Retrieves metrics from
EffectiveHorizontalPodAutoscaler
resources - Converts them into Prometheus metrics
- Retrieves metrics from
- getMetricsRuleError:
- Collects metrics related to errors in metric rules
- computePredictionMetric:
- Computes prediction metrics from
TimeSeriesPrediction
resources - Maps prediction metrics to their statuses
- Computes prediction metrics from
- Utility Functions:
AggregateSignalKey
: Aggregates signal keys from labelsMetricContains
: Checks if a prediction metric is contained within a list
Prometheus Integration
- The
CraneMetricCollector
integrates with Prometheus by implementing theprometheus.Collector
interface - It defines and collects custom metrics, which are exposed to Prometheus for monitoring and alerting
This detailed breakdown explains the role and functionality of the CraneMetricCollector
in collecting and exposing metrics for autoscaling and predictions
RecommenderManager
- Purpose:
- The
RecommenderManager
interface is responsible for managing and providing access to different recommenders based on their names and rules
- The
- Methods:
- GetRecommender: Returns a registered recommender by its name
- GetRecommenderWithRule: Returns a registered recommender with its configuration merged with a given
RecommendationRule
Manager Implementation
- NewRecommenderManager Function:
- Initializes a new
RecommenderManager
with a given recommendation configuration file - Calls
loadConfigFile
to load recommender configurations - Starts a goroutine to watch the configuration file for changes using
watchConfigFile
- Initializes a new
- Manager Struct:
- Fields:
recommendationConfiguration
: Path to the configuration filelock
: A mutex for synchronizing access torecommenderConfigs
recommenderConfigs
: A map storing configurations for each recommender
- Fields:
- GetRecommender Method:
- Calls
GetRecommenderWithRule
with an emptyRecommendationRule
to retrieve a recommender by name
- Calls
- GetRecommenderWithRule Method:
- Locks the
recommenderConfigs
map to ensure thread-safe access - Checks if the requested recommender exists in the
recommenderConfigs
- If found, calls
recommender.GetRecommenderProvider
to get the recommender with the merged configuration - Returns an error if the recommender name is unknown
- Locks the
- watchConfigFile Method:
- Uses
fsnotify.NewWatcher
to monitor the configuration file for changes - Reloads the configuration when changes are detected
- Uses
- loadConfigFile Method:
- Loads recommender configurations from the specified file
- Parses the file and populates the
recommenderConfigs
map
Kubernetes Components and Principles
- Concurrency:
- Uses a mutex (
sync.Mutex
) to ensure thread-safe access to shared resources likerecommenderConfigs
- Uses a mutex (
- File Watching:
- Utilizes
fsnotify
to watch the configuration file for changes, allowing dynamic updates to recommender configurations
- Utilizes
- Configuration Management:
- Manages recommender configurations through a centralized configuration file, enabling easy updates and management
This detailed breakdown explains the role and functionality of the RecommenderManager
in managing and providing access to recommenders based on configurations and rules
Recommendation Process Flow
- Data Collection:
- Prometheus Integration:
- The system collects metrics from Prometheus, which serves as a data source for historical and real-time metrics
- The
providers.History
interface is used to fetch historical data from Prometheus
- Prometheus Integration:
- Recommendation Rule Controller:
- Reconcile Function:
- The
RecommendationRuleController
watches forRecommendationRule
resources - It retrieves the
RecommendationRule
and checks if it needs to be processed based on itsRunInterval
andLastUpdateTime
- If the rule is due for execution, it calls
doReconcile
- The
- Reconcile Function:
- Resource Identification:
- getIdentities Function:
- Identifies target resources based on the
ResourceSelectors
in theRecommendationRule
- Uses dynamic client and listers to fetch resources matching the selectors
- Identifies target resources based on the
- getIdentities Function:
- Recommendation Execution:
- doReconcile Function:
- Prepares a list of
ObjectIdentity
for each target resource - Determines if a new recommendation round is needed based on the
RunInterval
- Executes recommendations concurrently for identified resources using
executeIdentity
- Prepares a list of
- doReconcile Function:
- Recommender Manager:
- GetRecommenderWithRule Function:
- Retrieves the appropriate recommender based on the
RecommendationRule
- Merges the rule’s configuration with the recommender’s default configuration
- Retrieves the appropriate recommender based on the
- GetRecommenderWithRule Function:
- Recommendation Calculation:
- executeIdentity Function:
- Creates or updates a
Recommendation
object for each target resource - Calls the
Run
method of the recommender, passing aRecommendationContext
- The recommender uses the context to fetch data from Prometheus and perform calculations
- Creates or updates a
- executeIdentity Function:
- Writing to CRD:
- Recommendation Object:
- The calculated recommendation is stored in a
Recommendation
CRD - The
executeIdentity
function updates theRecommendation
object with the computed values and status - Uses the Kubernetes client to create or update the
Recommendation
CRD in the cluster
- The calculated recommendation is stored in a
- Recommendation Object:
- Status Update:
- updateRecommendationRuleStatus Function:
- Updates the status of the
RecommendationRule
with the latest execution results - Ensures that the
LastUpdateTime
and other status fields are correctly set
- Updates the status of the
- updateRecommendationRuleStatus Function:
Kubernetes Components and Principles
- Dynamic Client:
- Used to interact with Kubernetes resources dynamically, allowing the system to handle various resource types
- Concurrency:
- Recommendations are executed concurrently for multiple resources to improve efficiency
- CRD Management:
- Custom Resource Definitions (CRDs) are used to store recommendations, providing a structured way to manage and access recommendation data
- Prometheus Data:
- Prometheus serves as a key data source, providing the metrics needed for making informed recommendations
This detailed breakdown explains the end-to-end process of generating recommendations from Prometheus data and writing them into CRDs
Prometheus Data Retrieval Process
- Provider Initialization:
- NewProvider Function:
- The process begins with the initialization of a Prometheus data provider using the
NewProvider
function - This function takes a
PromConfig
as input, which contains configuration details like the Prometheus server address and authentication information - It calls
NewPrometheusClient
to create a Prometheus client
- The process begins with the initialization of a Prometheus data provider using the
- NewProvider Function:
- Prometheus Client Creation:
- NewPrometheusClient Function:
- This function sets up a Prometheus client using the provided configuration
- It configures HTTP transport settings, including TLS and proxy settings
- If rate limiting is enabled, it creates a
prometheusRateLimitClient
; otherwise, it creates aprometheusAuthClient
- NewPrometheusClient Function:
- Query Execution:
- QueryTimeSeries Function:
- This function is responsible for querying time series data from Prometheus
- It constructs a PromQL query using a
MetricNamer
, which provides a query builder for the specific metric source - The query is built using the
BuildQuery
method of the query builder
- QueryTimeSeries Function:
- Context and Timeout Management:
- A context with a timeout is created using
gocontext.WithTimeout
to ensure the query does not run indefinitely - The context is used to execute the query, allowing for cancellation if the timeout is reached
- A context with a timeout is created using
- Query Range Execution:
- QueryRangeSync Method:
- The
QueryRangeSync
method of thecontext
struct is called to execute the query over a specified time range - It sends the query to the Prometheus server and retrieves the time series data
- The
- QueryRangeSync Method:
- Data Processing:
- The retrieved data is processed into a slice of
common.TimeSeries
, which represents the time series data in a structured format
- The retrieved data is processed into a slice of
- Error Handling:
- Throughout the process, errors are logged using
klog
to provide detailed information about any issues encountered during query execution
- Throughout the process, errors are logged using
Key Components and Principles
- Prometheus Client:
- The Prometheus client is responsible for sending HTTP requests to the Prometheus server and handling responses
- It supports authentication and rate limiting to manage query load
- PromQL:
- PromQL is the query language used to specify the metrics and time range for data retrieval
- The system constructs PromQL queries dynamically based on the metrics needed for analysis
- Concurrency and Rate Limiting:
- The system uses rate limiting to control the number of concurrent queries sent to Prometheus, preventing overload
- The
prometheusRateLimitClient
manages in-flight requests and blocks new requests if the limit is reached
This detailed breakdown explains the process of retrieving data from Prometheus, highlighting the function calls and components involved in the workflow