Résumé
This thesis focuses on optimizing the access of massive datasets that were generated and / or used in the management of intelligent electrical distribution networks or smart grids.These masses of data (raw measurements, refined data, historical data, etc.) are in practice represented in varied data models (relational, key-value, documents, graphs, etc.) and stored in very heterogeneous Big Data systems. These systems offer various functionalities (for example, some cannot perform a join), data structures (used for storage, indexing), algorithms and very different performances.The objective of this thesis is to propose an optimization approach for the processing of workflows on these datasets. The optimization we propose is the recommendation of the placement of the data on the most appropriate systems in order to minimize the total execution time of the workflow. It is based on metadata describing data sets, workflows, and storage and processing systems. Total execution time is composed of data transformation and movement time, and the execution time of queries rewritten based on those transformations. In this work, we are also exploring the possibility of moving data from one system to another if it offers interesting characteristics to favor the execution of workflow requests.In existing works estimating the execution time in data processing systems is commonly done using cost models. Although, the study of the techniques used in Big Data processing systems and integration / mediation systems of Big Data convinced us of the impossibility of using this approach to estimate the execution time. One interesting approach is to use machine learning techniques for our objective.We therefore propose an approach, called DWS – for Data, Workloads and Systems -, which explores the different combinations of systems that can execute a workflow. This approach eliminates solutions where the systems cannot execute all the operators of a query (feasibility condition) and which does not respect the business rules for the storage of the initial, intermediate or final dataset (condition of conformity). The final step consists of selecting the combination of systems that minimizes the execution time of the workload. The estimation of the execution time of the various queries (data transformation or extracted from the workflow) is based on the injection of statistics into the systems, to simulate the execution and thus retrieve the optimal plans and in case it is possible the queries cost estimated by the system. The costs obtained from the previous step are used with other metadata concerning the datasets, the workloads and the systems in a prediction model that is capable of estimating the execution time.To do this, we present in this document (i) a unified metadata model on data (sizes, distribution of values, availability, location, schemas, sizes, statistics, etc.), on workloads (queries, operators, applications, statistics, etc.) and on systems (APIs, data models, distribution models, storage models, etc.); (ii) An architecture and algorithms to support data placement recommendation for workloads; and (iii) the results of the experiments carried out using our prototype.
Source: http://www.theses.fr/2021GRALM076
.