Add a schema registry#264
Conversation
…indings packages.
@JeanLucPons this is a draft, so we have to give Teresia a chance to find solutions to the problems you flagged here. I agree with you that we should not run in circles reimplementing things that already work. However, in the case of configuration we were explicitly given feedback that it needs improvement and some rework is possible (already lots of new features like catalogs, regular expressions in the config file, etc.). It is the only part of the project that was flagged as a bit too complex after the workshop. So, we have to suffer a bit. This PR has some potentially nice features and some annoying things too. However, when it's a draft it's too hard to review because nothing works and there are too many changes. I remind you that with this PR @TeresiaOlsson is somewhat learning how everything is structured right now, which can be somewhat challenging. But afterwards it should be easier to contribute to different parts of the project. @TeresiaOlsson In general I like the idea of Jean Luc that some big PR should start small by changing for example only a part of the config (single element). This does make the review a bit more straighforward. In the case of configuration it's very hard to do (and too late) but maybe for some other proposals it would make sense. @JeanLucPons @TeresiaOlsson @gupichon @gupichon-soleil Maybe I'll ask the steering committee to put a (reasonable) deadline for freezing configuration (no large reworks, only improvements). That way it will be clear for everyone that we have to converge on a working solution in a reasonable time. What do you think? I can also give you the deadline myself :) Let's say the week of July 13th we freeze. I would say that we should also ask the steering committee that the next community meeting (September?) should be an online-hackathon where people should try the code and give their feedback. |
|
I quickly tested the creation of a dynamic model with arbitrary type and it worked. from pydantic import create_model, ConfigDict
import inspect
import json
class MyClass1:
def __init__(self,param1: str,param2: int):
...
class MyClass2:
def __init__(self,param1:str,param2:MyClass1):
...
def register_schema(cls):
# Inspect constructor signature and create dyammic schemas
# for both validation and json schema export
sig = inspect.signature(cls.__init__)
fields = {}
sch_fields = {}
for param_name, param in sig.parameters.items():
# Jump self
if param_name == 'self':
continue
# Default value,
if param.default != inspect.Parameter.empty:
default = param.default
else:
default = ... # No default value specified -> Required field
# Type
annotation = param.annotation if param.annotation != inspect.Parameter.empty else str
sch_fields[param_name] = (annotation, default)
fields[param_name] = (annotation, default)
if annotation is MyClass1:
# Arbitrary type example
# Replace by its ConfigModel (for json schema export)
annotation = MyClass1._model
sch_fields[param_name] = (annotation, default)
# Create dynamic model for the class
model_name = cls.__module__ + "." + cls.__name__
cls._model = create_model(model_name, __config__=ConfigDict(arbitrary_types_allowed=True), **fields)
cls._schemamodel = create_model(model_name+"Schema", __config__=ConfigDict(arbitrary_types_allowed=True), **sch_fields)
# Override __init__ to handle validation
original_init = cls.__init__
@classmethod
def custom_init(cls, *args, **kwargs):
# Validate using dynamic model
sig = inspect.signature(original_init)
cls._model.model_validate(kwargs)
original_init(cls, *args, **kwargs)
cls.__init__ = custom_init
return cls
register_schema(MyClass1)
register_schema(MyClass2)
mc1 = MyClass1(param1="test1",param2=10)
mc2 = MyClass2(param1="test2",param2=mc1)
top_level_json = MyClass2._schemamodel.model_json_schema()
with open('sch.json', 'w') as fp:
json.dump(top_level_json, fp) |
|
Personally I don't see how we can freeze for refactoring. Part of the culture of open source is that everyone is allowed to make PRs about everything at any time if they have a suggestion. It doesn't mean the PR has to be accepted and merged in. But if you want the vague "before the summer vacations" deadline the steering committee has given the maintainers for simplification of configuration to mean the week of the 13th July that sounds like a good idea to make the deadline concrete. For the community meetings the steering committee currently isn't so involved in it. They only approve a date that I set and then I come up with the agenda myself. If you think it would be a good idea to make it into a hackathon we can do that. The only thing I have lined up for a future meeting is a presentation about osprey but that doesn't have to be on the September meeting. I agree that making smaller changes to merge things in in steps (like you are doing for the catalog) would be preferable. I will try to do that. Only issue is that we need to rewrite the tests since many of them currently relies on loading a configuration file. I think we should try to make unit tests which rely on as few other parts of the code as possible so parts can be tested individually. And then we can have integration tests which tests things together. I don't see the need to test loading the configuration in many tests. |
There are no issues with PRs and changes for me. It's only "freezing" to keep us on track so we can meet all the milestones. I think the PR is timed well because it is now that we need to improve the configuration. |
|
I updated my code on the above post. I did 2 schema to handle validation and json schema. |
So the approach is to not have the I like this approach better than what I tested in #273 because that suddenly mixed validation and object creation and allowed pydantic to resolve nested models by itself and start to act like a factory. This approach doesn't seem to do that since you need to first create an object of |
|
I think the difference compared to adding |
|
I'm not sure that overriding the constructor is a good idea. Model validation should not be done inside the class constructor, as you may want to validate data before the actual instantiation. This validation could be shared between the configuration validation layer and the constructor (as it might be used directly). Maybe in a class method with a decorator? |
|
I agree. But I think overriding the constructor is not needed. We only need that part if we want the class to validate automatically when an object is created. If not, we can keep the validation in a separate step using the SchemaValidator. That only requires the schema to be validated against to be registered in the SchemaRegistry. |
Yes, that's why I thinking about two decorators: one for registration and another for validation. If the second one is absent, it means no validation, which is a good first-step approach. |
|
Personally I would prefer to have the validation enabled by default for both "on the fly" construction and when loading. |
|
For configuration param getter/setter I imagine something like below. class ConfigReadWriteStringScalar:
def __init__(self,parent: object,parameter:str):
self.parent = parent
self.parameter = parameter
def set(self,value:str):
setattr(self.parent,self.parameter,value)
def get(self):
return getattr(self.parent,self.parameter)
def assign_param(object,frame):
ai:inspect.ArgInfo = inspect.getargvalues(frame)
for argname,value in ai.locals.items():
if argname != "self":
setattr(object,"_pyaml_"+argname,value) # We can define a syntax usable for repr as well
class MyClass1:
def __init__(self,param1: str,param2: int):
# Assign parameter to local fields
# To use this helper function, the constructor cannot be overloaded as above (lost of signature)
assign_param(self,inspect.currentframe())
# Follow the pyaml coding style and code completion is OK
@property
def param1(self) -> ConfigReadWriteStringScalar:
return ConfigReadWriteStringScalar(self,"_pyaml_param1") |
What do you think about separating it into two different decorators like @gupichon suggested? One which registers into the schema registry and one which adds the validation to the constructor? That feels easier to understand to me than one which is doing both things. |
I'm not sure I understand the purpose of this. What is |
I'm sure to see how you want to implement it but overriding the constructor creates the issue I mentioned above + the potential lost of completion in IPython. The validation can be handled in the
It allows to write: mc1 = MyClass1(param1="test1",param2=10)
mc2 = MyClass2(param1="test2",param2=mc1)
print(mc1.param1.get())
mc1.param1.set("new value")
print(mc1.param1.get()) |
|
Hi @TeresiaOlsson and @JeanLucPons, from pydantic import create_model, ConfigDict
from functools import wraps
import inspect
import json
import yaml
def register_schema(
cls=None,
*,
override_constructor=True,
custom_validation_method=None,
):
def decorate(target_cls):
# Inspect constructor signature and create dynamic schemas
# for both validation and json schema export
sig = inspect.signature(target_cls.__init__)
fields = {}
sch_fields = {}
for param_name, param in sig.parameters.items():
# Jump self
if param_name == "self":
continue
# Default value,
if param.default != inspect.Parameter.empty:
default = param.default
else:
default = ... # No default value specified -> Required field
# Type
annotation = (
param.annotation if param.annotation != inspect.Parameter.empty else str
)
fields[param_name] = (annotation, default)
# If another class has already been registered, use its schema model
# for config validation and JSON Schema export.
schema_annotation = getattr(annotation, "_schemamodel", annotation)
sch_fields[param_name] = (schema_annotation, default)
# Create dynamic model for the class
model_name = target_cls.__module__ + "." + target_cls.__name__
target_cls._model = create_model(
model_name, __config__=ConfigDict(arbitrary_types_allowed=True), **fields
)
target_cls._schemamodel = create_model(
model_name + "Schema",
__config__=ConfigDict(arbitrary_types_allowed=True),
**sch_fields,
)
def run_custom_validation(validated_config):
if custom_validation_method is None:
return
custom_validation = getattr(target_cls, custom_validation_method)
custom_validation(validated_config)
def _validate_constructor_config(class_, config):
validated_config = class_._model.model_validate(config)
run_custom_validation(validated_config)
return validated_config
def validate_config(class_, config, *, input_format=None):
if input_format not in (None, "json", "yaml"):
raise ValueError("input_format must be None, 'json', or 'yaml'")
if isinstance(config, str):
if input_format == "json":
validated_config = class_._schemamodel.model_validate_json(config)
elif input_format == "yaml":
validated_config = class_._schemamodel.model_validate(
yaml.safe_load(config)
)
else:
try:
parsed_config = json.loads(config)
except json.JSONDecodeError:
parsed_config = yaml.safe_load(config)
validated_config = class_._schemamodel.model_validate(parsed_config)
else:
validated_config = class_._schemamodel.model_validate(config)
run_custom_validation(validated_config)
return validated_config
target_cls._validate_constructor_config = classmethod(_validate_constructor_config)
target_cls.validate_config = classmethod(validate_config)
if override_constructor:
# Override __init__ to handle validation
original_init = target_cls.__init__
original_sig = inspect.signature(original_init)
@wraps(original_init)
def custom_init(self, *args, **kwargs):
# Validate using dynamic model
bound_args = original_sig.bind_partial(self, *args, **kwargs)
validation_data = {
name: value
for name, value in bound_args.arguments.items()
if name != "self"
}
target_cls._validate_constructor_config(validation_data)
original_init(self, *args, **kwargs)
target_cls.__init__ = custom_init
return target_cls
if cls is None:
return decorate
return decorate(cls)
@register_schema
class MyClass1:
def __init__(self, param1: str, param2: int): ...
@register_schema(override_constructor=False, custom_validation_method="custom_validate")
class MyClass2:
def __init__(self, param1: str, param2: MyClass1): ...
@classmethod
def custom_validate(cls, config):
if config.param1 == config.param2.param1:
raise ValueError("MyClass2.param1 must differ from MyClass1.param1")
@register_schema(override_constructor=False)
class MyClass3:
def __init__(self, param1: str, param2: MyClass2): ...
# register_schema(MyClass1)
# register_schema(MyClass2)
mc1 = MyClass1(param1="test1", param2=10)
mc2 = MyClass2(param1="test2", param2=mc1)
mc3 = MyClass3(param1="test3", param2=mc2)
top_level_json = MyClass2._schemamodel.model_json_schema()
with open("sch.json", "w") as fp:
json.dump(top_level_json, fp)
my_class1_json = """
{
"param1": "from-json",
"param2": 20
}
"""
validated_mc1 = MyClass1.validate_config(my_class1_json, input_format="json")
mc1_from_json = MyClass1(**validated_mc1.model_dump())
my_class2_yaml = """
param1: from-yaml
param2:
param1: nested-from-yaml
param2: 30
"""
validated_mc2 = MyClass2.validate_config(yaml.safe_load(my_class2_yaml))
validated_mc2_data = validated_mc2.model_dump()
mc2_param2 = MyClass1(**validated_mc2_data["param2"])
mc2_from_yaml = MyClass2(
param1=validated_mc2_data["param1"],
param2=mc2_param2,
)
my_class3_yaml = """
param1: top-from-yaml
param2:
param1: nested-myclass2-from-yaml
param2:
param1: nested-myclass1-from-yaml
param2: 40
"""
validated_mc3 = MyClass3.validate_config(my_class3_yaml, input_format="yaml")
validated_mc3_data = validated_mc3.model_dump()
mc3_param2_param2 = MyClass1(**validated_mc3_data["param2"]["param2"])
mc3_param2 = MyClass2(
param1=validated_mc3_data["param2"]["param1"],
param2=mc3_param2_param2,
)
mc3_from_yaml = MyClass3(
param1=validated_mc3_data["param1"],
param2=mc3_param2,
) |
|
I've done a few tests and it's a bit trickier with inheritance... |
My personal opinion is that it will be too difficult for the users to understand. They will forget to use The only way I see it working is by doing something similar to how it is done in class MyClass1:
def __init__(self,param1: str, param2: int):
self._param1 = pyAMLAttribute(param1)
self._param2 = pyAMLAttribute(param2)
@property
def param1(self) -> pyAMLAttribute:
return self._param1and then the But that is a different way to write classes which would affect the whole hardware abstraction layer. I'm not sure it's worth it compared to just doing the validation in a separate step before you initialize an object. |
What you propose is OK for me. |
I think when using |
In that case I would still leave it out of this PR. It already includes too many things and probably needs to be separated into steps somehow to be able to review and merge. One way might be to first implement the parts needed for the schema registry with as few changes as possible to the existing ConfigModels and classes (maybe even do it module by module if possible like you do for the catalog) and then in a second step remove the ConfigModels + add dynamic generation of them. Then in the third step implement the things needed for having the automatic validation when creating objects. What do you think? I think there might be a way to make this PR smaller by using the same approach as for the dynamic generation to turn the ConfigModels into schema classes. Then they can be registered and the SchemaRegistry, SchemaValidator and SchemaGenerator can be reviewed and merged in without having to modify a lot of classes in the same PR. I'm working on that part now. |
Yes. in my first example I create 2 models for that. The idea is to rebuild dynamically exactly what you did manually in your PR to export json format but it cannot handle validation. You can update your PR, get rid off configmodel and build it dynamically but it will be a bit difficult to handle correctly the following part will all registered classes (including compound type). if annotation is MyClass1:
# Arbitrary type example
# Replace by its ConfigModel (for json schema export)
annotation = MyClass1._schemamodel # sorry for the mistake (due to testing) |
I would do it at same time and cut the PR into categories rather than features as we do for hardware abstraction. Otherwise you will face repr issue (and likely others). |
|
I have noticed a major issue here. If you want to inherit from an existing class without adding any attributes, but simply want to change its behavior, it won't work. You would have to change the signature, which is quite annoying for high-level PyAML objects. |
|
Yes I think @TeresiaOlsson already mentioned this issue during the last meeting. |
|
I just removed them because a JSON schema validator can't handle two schemas where everything is the same. In the eyes of the validator it is the same schema. So I thought that were no reason to maintain several classes. And my goal was also eventually to manage to have oneOf instead of anyOf in the schema but so far that has turned out to be difficult. |
|
There is one way to do this by storing the registered schemas, but it's ugly and quite complex... Also, if the schema is the same (inheritance without new attributes), it would be possible to have an optionnal attribute like param1: top-MyClass3
param2:
__class__: MyClass2Special
param1: nested-special
param2:
param1: nested-myclass1
param2: 45Actually, it's a bit like the "type" field we are using today. |
Is oneOf a part of the json pydantic schema ? |
it is, at least, a part of JSON schema standard: |
I think pydantic can give it but only if you use the discriminated unions with a discriminator field but I haven't tried. Mostly pydantic gives anyOf. Since we anyway need a custom JSON schema generator to replace the abstract classes in the schema with their implementation I just added to it to replace anyOf to oneOf. That was easy. The problem occurred when I tried to use it in the Metaconfigurator because it crashed and became weird when using oneOf instead of anyOf for some of the subschemas. However, I'm still trying to debug it because I'm not sure that the problem is with the schema. It could equally well be a bug on their side with nested models. In the generator I also added a Literal with the string of each class which I think should make the schemas unique and fulfill oneOf. But that had the same bugs. I wanted oneOf to work because it renders much nicer in the GUI. Instead of being allowed to click several options, the user is only allowed to click one. For example, it is confusing for the user why they can choose something to be both a Quadrupole and a Sextupole at the same time. That only makes sense to someone who has looked into the details of JSON schema validation and know the difference between anyOf and oneOf. |
This PR adds a schema registry that can be used for validation and generating json schemas for dynamic nested pydantic models.
Motivation:
Better separation of concerns. Validation of the configuration is separated into different classes than the ones responsible for storing and using the configuration, resulting in a separate validation layer.
It is possible to make validation option. The entire validation layer can be skipped if the user wished. For example, if they know that the input data has already been validated once and not changed.
More lose dependency of pydantic. Pydantic is only used at the edge of the core and not everywhere in line with what has been discussed for how to manage the dependency of pint.
Solving two problems with pydantic:
Features:
A schema registry which maps a class path to the schema that should be used for validation of the input data to the class.
A decorator to automatically register schemas in the schema registry.
A schema validator which validates nested dynamic models with the use of the registry.
A custom json schema generator which generates json schemas including all available subclasses with the use of the registry.
Major changes:
All ConfigModels have been separate from the domain classes and turned into schema classes which only has the purpose to define the schema used for validation. This is a major refactoring since the ConfigModels are spread out in the codebase. A temporary legacy handler had to be implemented to handle the packages for the bindings. All domain classes also had to be changed since they now need to explicitly declare their attributes.
A baseclass
ConfigurationSchemahas been added that all schemas must inherit from. This is to ensure that all schemas registered in the registry has the minimum required fields and same behaviour.The yaml file format has been changed to instead of defining the type by the module it is defined by the class. This is to be compatible with external packages where a certain module structure can't be enforced and to allow several similar classes to be grouped in the same module. A temporary function to translate between yaml file formats have been added.
Example usage:
Examples of the usage is available in https://github.com/python-accelerator-middle-layer/pyaml/tree/schema-registry/examples/validation.
The json schema can be visualized and tested with MetaConfigurator to easier understand what it includes. You can import the schema directly from here: https://github.com/python-accelerator-middle-layer/pyaml/blob/schema-registry/pyaml/validation/schema.json