选自我的笔记,项目非常庞大,希望给各位同好一点启发,这里主要还是推荐部分模块的源码走读

架构

The project structure is organized into several directories and files, each serving a specific purpose. Here’s a brief overview of the main components

  • cmd/
    • metric-adapter/: Contains code related to the metric adapter component
    • craned/: Contains the main application code for the craned command
      • app/: Likely contains application-specific logic and configurations
      • main.go: The entry point for the craned application
    • crane-agent/: Contains code related to the crane agent component
  • deploy/: Contains deployment scripts and configurations for deploying the application
  • docs/: Documentation files for the project
  • examples/: Example configurations or usage examples for the project
  • hack/: Scripts and utilities for development and maintenance tasks
  • overrides/: Possibly contains configuration overrides or custom settings
  • pkg/: Contains the core library code, which can be used by other applications or commands
  • site/: Could be related to the project’s website or documentation site
  • tools/: Additional tools or scripts that support the project
  • Files:
    • go.mod and go.sum: Go module files that manage dependencies
    • README.md and README_zh.md: Documentation files providing an overview and instructions for the project
    • Makefile: Contains build instructions and tasks for the project
    • Dockerfile: Instructions for building a Docker image of the application
    • .gitignore: Specifies files and directories to be ignored by Git
    • .golangci.yaml: Configuration for GolangCI, a Go linter
    • CONTRIBUTING.md and CODE_OF_CONDUCT.md: Guidelines for contributing to the project and expected behavior

Manager 部分

  1. Command Initialization:
    • NewManagerCommand(ctx context.Context) *cobra.Command:
      • Creates a new Cobra command for craned
      • Initializes options using options.NewOptions()
      • Sets up command flags and feature gates
  2. Command Execution:
    • The Run function within the Cobra command is executed:
      • Calls opts.Complete() to complete option setup
      • Validates options with opts.Validate()
      • Executes Run(ctx, opts) to start the manager
  3. Run Function:
    • Run(ctx context.Context, opts *options.Options) error:
      • Configures the controller manager with ctrl.GetConfigOrDie()
      • Sets up controller options, including leader election and metrics binding
      • Creates a new manager with ctrl.NewManager(config, ctrlOptions)
  4. Health Checks:
    • Adds health and readiness checks using mgr.AddHealthzCheck and mgr.AddReadyzCheck
  5. Data Sources and Predictor Initialization:
    • initDataSources(mgr, opts): Initializes real-time and historical data sources
    • initPredictorManager(opts, realtimeDataSources, historyDataSources): Sets up the predictor manager using the initialized data sources
  6. Scheme and Indexer Initialization:
    • initScheme(): Registers API types with the runtime scheme
    • initFieldIndexer(mgr): Sets up field indexers for efficient querying
  7. Webhooks Initialization:
    • initWebhooks(mgr, opts): Configures webhooks if necessary
  8. Pod OOM Recorder:
    • Initializes oom.PodOOMRecorder to track out-of-memory events
    • Sets up the recorder with the manager and runs it in a separate goroutine
  9. Controller Initialization:
    • initControllers(ctx, podOOMRecorder, mgr, opts, predictorMgr, historyDataSources[providers.PrometheusDataSource]): Initializes various controllers, including analytics and recommendation controllers
  10. Metric Collector Initialization:
  • initMetricCollector(mgr): Sets up custom metrics collection
  1. Running All Components:
  • runAll(ctx, mgr, predictorMgr, dataSourceProviders[providers.PrometheusDataSource], opts):
    • Starts all components, ensuring that the manager and predictor manager are running

Controllers 部分

PodOOMRecorder

  1. SetupWithManager:
    • SetupWithManager(mgr ctrl.Manager) error:
      • This function is called to register the PodOOMRecorder with the controller manager
      • It sets up the reconciler to watch for pod events and trigger the Reconcile function
  2. Reconcile Function:
    • Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error):
      • Triggered by the controller manager for each pod event
      • Retrieves the pod specified in the request
      • Checks if the pod was terminated due to an OOM event using IsOOMKilled(pod)
      • If OOMKilled, iterates over container statuses to find the terminated container
      • For each OOMKilled container, checks if memory requests are set
      • Increments the OOM count metric and adds an OOMRecord to the queue
  3. Run Function:
    • Run(stopCh <-chan struct{}) error:
      • Continuously processes the OOM records from the queue
      • Uses a ticker to periodically clean old OOM records to prevent exceeding ConfigMap storage limits
      • Retrieves OOM records using GetOOMRecord
      • Updates OOM records with updateOOMRecord
  4. GetOOMRecord Function:
    • GetOOMRecord() ([]OOMRecord, error):
      • Retrieves OOM records from the cache or ConfigMap
      • If the cache is nil, it fetches the ConfigMap and unmarshals the OOM records
  5. cleanOOMRecords Function:
    • cleanOOMRecords(oomRecords []OOMRecord) []OOMRecord:
      • Cleans up old OOM records to maintain the maximum number specified by OOMRecordMaxNumber
  6. updateOOMRecord Function:
    • updateOOMRecord(oomRecord OOMRecord, saved []OOMRecord) error:
      • Updates the OOM records in the cache or storage with new records
  7. IsOOMKilled Function:
    • IsOOMKilled(pod *v1.Pod) bool:
      • Helper function to determine if a pod was terminated due to an OOM event

Detailed Flow

  • Initialization: The PodOOMRecorder is set up with the manager using SetupWithManager, which registers it to watch for pod events
  • Event Handling: For each pod event, Reconcile is called to check for OOM events and record them
  • Record Management: The Run function processes the queue of OOM records, periodically cleans old records, and updates the storage
  • Data Retrieval: GetOOMRecord fetches the current OOM records, either from the cache or the ConfigMap

This detailed explanation provides a comprehensive view of the PodOOMRecorder’s functionality and its role in tracking OOM events in a Kubernetes cluster


GetOOMRecord Function

  1. Mutex Locking:
    • The function begins by acquiring a lock using r.mu.Lock() to ensure thread-safe access to the cache. It releases the lock with defer r.mu.Unlock()
  2. Cache Check:
    • It checks if the cache is nil. If the cache is not initialized, it proceeds to fetch the OOM records from a Kubernetes ConfigMap
  3. ConfigMap Retrieval:
    • Uses the Kubernetes client (r.Client) to get the ConfigMap that stores OOM records
    • The ConfigMap is identified by its namespace (known.CraneSystemNamespace) and name (ConfigMapOOMRecordName)
  4. Error Handling:
    • If the ConfigMap is not found (apierrors.IsNotFound(err)), it returns nil, indicating no records are available
    • For other errors, it returns the error to the caller
  5. Data Unmarshalling:
    • If the ConfigMap is found, it retrieves the data field (ConfigMapDataOOMRecord) containing the OOM records
    • Uses json.Unmarshal to convert the JSON data into a slice of OOMRecord structs
  6. Return Records:
    • Returns the unmarshalled OOM records or the cached records if the cache is already initialized

Kubernetes Components and Principles

  • Kubernetes Client:
    • The function uses the client.Client interface from the Kubernetes client-go library to interact with the Kubernetes API server
    • This client provides methods to perform CRUD operations on Kubernetes resources, such as ConfigMaps
  • ConfigMap:
    • A ConfigMap is a Kubernetes resource used to store non-confidential data in key-value pairs
    • In this context, it is used to persist OOM records across pod restarts or application restarts
  • Concurrency:
    • The use of a mutex (sync.Mutex) ensures that access to shared resources (like the cache) is synchronized, preventing race conditions
  • Error Handling:
    • The function handles errors gracefully, distinguishing between not-found errors and other types of errors to provide appropriate responses

This detailed breakdown explains how the GetOOMRecord function retrieves OOM records using Kubernetes components and concurrency principles


IsOOMKilled Function

  1. Function Purpose:
    • The IsOOMKilled function checks if any container within a given pod was terminated due to an OOM event
  2. Iterating Over Container Statuses:
    • The function iterates over the ContainerStatuses slice in the pod’s status. This slice contains the status of each container in the pod
  3. Checking Conditions:
    • For each containerStatus, it checks the following conditions:
      • Restart Count: The RestartCount must be greater than 0, indicating that the container has been restarted
      • Termination State: The LastTerminationState.Terminated must not be nil, meaning the container has a recorded termination state
      • Termination Reason: The Reason for termination must be "OOMKilled", indicating that the container was terminated due to an OOM event
  4. Return Value:
    • If all the above conditions are met for any container, the function returns true, indicating that the pod was OOM killed
    • If none of the containers meet these conditions, the function returns false

Kubernetes Components and Principles

  • Pod Status:
    • The function uses the v1.Pod type from the Kubernetes API, which includes the status of the pod and its containers
    • The ContainerStatuses field provides detailed information about each container’s state, including termination details
  • OOMKilled Reason:
    • The "OOMKilled" reason is a standard Kubernetes termination reason indicating that a container was terminated due to exceeding its memory limit

This explanation provides a clear understanding of how the IsOOMKilled function operates, using Kubernetes pod status information to determine if a pod was terminated due to an OOM event


MetricCollector

CraneMetricCollector Structure

  1. Purpose:
    • The CraneMetricCollector is responsible for collecting and exposing metrics related to autoscaling and predictions in a Kubernetes environment
  2. Fields:
    • Client: A Kubernetes client for interacting with the cluster
    • Scaler: A scale.ScalesGetter for retrieving scale subresources
    • RESTMapper: A meta.RESTMapper for mapping resources to their REST paths
    • Prometheus Descriptors:
      • metricAutoScalingCron: Describes external metrics for cron-based HPA
      • metricAutoScalingPrediction: Describes external metrics for prediction-based HPA
      • metricPredictionTsp: Describes model metrics for time series prediction (TSP)
      • metricMetricRuleError: Describes metrics related to rule errors
  3. PredictionMetric Structure:
    • Represents a prediction metric with fields for description, target details, algorithm, metric value, and timestamp

Functions

  1. NewCraneMetricCollector:
    • Initializes a new CraneMetricCollector with the provided Kubernetes client, scale client, and REST mapper
  2. Describe:
    • Implements the prometheus.Collector interface
    • Sends the descriptors of each metric to the provided channel
  3. Collect:
    • Implements the prometheus.Collector interface
    • Collects metrics and sends them to the provided channel
    • Calls helper functions to gather specific metrics:
      • getMetricsTsp: Collects metrics related to time series predictions
      • getMetricsCron: Collects metrics related to cron-based autoscaling
      • getMetricsRuleError: Collects metrics related to rule errors
  4. getMetricsTsp:
    • Retrieves metrics from TimeSeriesPrediction resources
    • Converts them into PredictionMetric objects
  5. getMetricsCron:
    • Retrieves metrics from EffectiveHorizontalPodAutoscaler resources
    • Converts them into Prometheus metrics
  6. getMetricsRuleError:
    • Collects metrics related to errors in metric rules
  7. computePredictionMetric:
    • Computes prediction metrics from TimeSeriesPrediction resources
    • Maps prediction metrics to their statuses
  8. Utility Functions:
    • AggregateSignalKey: Aggregates signal keys from labels
    • MetricContains: Checks if a prediction metric is contained within a list

Prometheus Integration

  • The CraneMetricCollector integrates with Prometheus by implementing the prometheus.Collector interface
  • It defines and collects custom metrics, which are exposed to Prometheus for monitoring and alerting

This detailed breakdown explains the role and functionality of the CraneMetricCollector in collecting and exposing metrics for autoscaling and predictions


RecommenderManager

  1. Purpose:
    • The RecommenderManager interface is responsible for managing and providing access to different recommenders based on their names and rules
  2. Methods:
    • GetRecommender: Returns a registered recommender by its name
    • GetRecommenderWithRule: Returns a registered recommender with its configuration merged with a given RecommendationRule

Manager Implementation

  1. NewRecommenderManager Function:
    • Initializes a new RecommenderManager with a given recommendation configuration file
    • Calls loadConfigFile to load recommender configurations
    • Starts a goroutine to watch the configuration file for changes using watchConfigFile
  2. Manager Struct:
    • Fields:
      • recommendationConfiguration: Path to the configuration file
      • lock: A mutex for synchronizing access to recommenderConfigs
      • recommenderConfigs: A map storing configurations for each recommender
  3. GetRecommender Method:
    • Calls GetRecommenderWithRule with an empty RecommendationRule to retrieve a recommender by name
  4. GetRecommenderWithRule Method:
    • Locks the recommenderConfigs map to ensure thread-safe access
    • Checks if the requested recommender exists in the recommenderConfigs
    • If found, calls recommender.GetRecommenderProvider to get the recommender with the merged configuration
    • Returns an error if the recommender name is unknown
  5. watchConfigFile Method:
    • Uses fsnotify.NewWatcher to monitor the configuration file for changes
    • Reloads the configuration when changes are detected
  6. loadConfigFile Method:
    • Loads recommender configurations from the specified file
    • Parses the file and populates the recommenderConfigs map

Kubernetes Components and Principles

  • Concurrency:
    • Uses a mutex (sync.Mutex) to ensure thread-safe access to shared resources like recommenderConfigs
  • File Watching:
    • Utilizes fsnotify to watch the configuration file for changes, allowing dynamic updates to recommender configurations
  • Configuration Management:
    • Manages recommender configurations through a centralized configuration file, enabling easy updates and management

This detailed breakdown explains the role and functionality of the RecommenderManager in managing and providing access to recommenders based on configurations and rules


Recommendation Process Flow

  1. Data Collection:
    • Prometheus Integration:
      • The system collects metrics from Prometheus, which serves as a data source for historical and real-time metrics
      • The providers.History interface is used to fetch historical data from Prometheus
  2. Recommendation Rule Controller:
    • Reconcile Function:
      • The RecommendationRuleController watches for RecommendationRule resources
      • It retrieves the RecommendationRule and checks if it needs to be processed based on its RunInterval and LastUpdateTime
      • If the rule is due for execution, it calls doReconcile
  3. Resource Identification:
    • getIdentities Function:
      • Identifies target resources based on the ResourceSelectors in the RecommendationRule
      • Uses dynamic client and listers to fetch resources matching the selectors
  4. Recommendation Execution:
    • doReconcile Function:
      • Prepares a list of ObjectIdentity for each target resource
      • Determines if a new recommendation round is needed based on the RunInterval
      • Executes recommendations concurrently for identified resources using executeIdentity
  5. Recommender Manager:
    • GetRecommenderWithRule Function:
      • Retrieves the appropriate recommender based on the RecommendationRule
      • Merges the rule’s configuration with the recommender’s default configuration
  6. Recommendation Calculation:
    • executeIdentity Function:
      • Creates or updates a Recommendation object for each target resource
      • Calls the Run method of the recommender, passing a RecommendationContext
      • The recommender uses the context to fetch data from Prometheus and perform calculations
  7. Writing to CRD:
    • Recommendation Object:
      • The calculated recommendation is stored in a Recommendation CRD
      • The executeIdentity function updates the Recommendation object with the computed values and status
      • Uses the Kubernetes client to create or update the Recommendation CRD in the cluster
  8. Status Update:
    • updateRecommendationRuleStatus Function:
      • Updates the status of the RecommendationRule with the latest execution results
      • Ensures that the LastUpdateTime and other status fields are correctly set

Kubernetes Components and Principles

  • Dynamic Client:
    • Used to interact with Kubernetes resources dynamically, allowing the system to handle various resource types
  • Concurrency:
    • Recommendations are executed concurrently for multiple resources to improve efficiency
  • CRD Management:
    • Custom Resource Definitions (CRDs) are used to store recommendations, providing a structured way to manage and access recommendation data
  • Prometheus Data:
    • Prometheus serves as a key data source, providing the metrics needed for making informed recommendations

This detailed breakdown explains the end-to-end process of generating recommendations from Prometheus data and writing them into CRDs


Prometheus Data Retrieval Process

  1. Provider Initialization:
    • NewProvider Function:
      • The process begins with the initialization of a Prometheus data provider using the NewProvider function
      • This function takes a PromConfig as input, which contains configuration details like the Prometheus server address and authentication information
      • It calls NewPrometheusClient to create a Prometheus client
  2. Prometheus Client Creation:
    • NewPrometheusClient Function:
      • This function sets up a Prometheus client using the provided configuration
      • It configures HTTP transport settings, including TLS and proxy settings
      • If rate limiting is enabled, it creates a prometheusRateLimitClient; otherwise, it creates a prometheusAuthClient
  3. Query Execution:
    • QueryTimeSeries Function:
      • This function is responsible for querying time series data from Prometheus
      • It constructs a PromQL query using a MetricNamer, which provides a query builder for the specific metric source
      • The query is built using the BuildQuery method of the query builder
  4. Context and Timeout Management:
    • A context with a timeout is created using gocontext.WithTimeout to ensure the query does not run indefinitely
    • The context is used to execute the query, allowing for cancellation if the timeout is reached
  5. Query Range Execution:
    • QueryRangeSync Method:
      • The QueryRangeSync method of the context struct is called to execute the query over a specified time range
      • It sends the query to the Prometheus server and retrieves the time series data
  6. Data Processing:
    • The retrieved data is processed into a slice of common.TimeSeries, which represents the time series data in a structured format
  7. Error Handling:
    • Throughout the process, errors are logged using klog to provide detailed information about any issues encountered during query execution

Key Components and Principles

  • Prometheus Client:
    • The Prometheus client is responsible for sending HTTP requests to the Prometheus server and handling responses
    • It supports authentication and rate limiting to manage query load
  • PromQL:
    • PromQL is the query language used to specify the metrics and time range for data retrieval
    • The system constructs PromQL queries dynamically based on the metrics needed for analysis
  • Concurrency and Rate Limiting:
    • The system uses rate limiting to control the number of concurrent queries sent to Prometheus, preventing overload
    • The prometheusRateLimitClient manages in-flight requests and blocks new requests if the limit is reached

This detailed breakdown explains the process of retrieving data from Prometheus, highlighting the function calls and components involved in the workflow