Schema Inference in Python

Data Quality
Python
Schema Analysis
Data Discovery
Pydantic
Data Validation
datamodel-code-generator
JSON
Published

September 22, 2025

This post is about schema inference in Python. I’m focusing on dict and JSON data here which are really ubiquitous.

This is useful for understanding the structure and content of new data sources. The initial schema can be used as a draft for a validation model later on.

I’ll use Faker (see my last post) to create a mock dictionary: Faker.pydict

sample_dict = {'wind': 6.50703274499078,
 'interesting': 34.6591059348561,
 'to': 690698.415313144,
 'space': {'ten': 'NoVHnFEHRdQDnxsnwHRL',
  'morning': 9547,
  'him': 'https://vargas.net/blog/blog/postsabout.php'},
 'against': 419}

I’ll then be using datamodel-code-generator which you can install with pip.

It’s often used as a CLI but can also be used to analyze data in code. This example creates a Python file containing a Pydantic v1 model (the default output) of the sample dictionary.

from datamodel_code_generator import InputFileType, generate
import pathlib

generate(
    sample_dict,
    input_file_type=InputFileType.Dict,
    output=pathlib.Path('/tmp/sample_dict_model.py')
)

generate doesn’t return a string because it could potentially produce multiple .py files. More info on that here

with open('/tmp/sample_dict_model.py', 'r') as fh:
    dict_model_str = fh.read()

print(dict_model_str)
# generated by datamodel-codegen:
#   filename:  <dict>
#   timestamp: 2025-09-22T08:11:25+00:00

from __future__ import annotations

from pydantic import BaseModel


class Space(BaseModel):
    ten: str
    morning: int
    him: str


class Model(BaseModel):
    wind: float
    interesting: float
    to: float
    space: Space
    against: int

Input data can be any of the following:

See here for more info.

Dictionaries are converted to JSON before the model is generated. If data is already serialized as JSON, the model can use this instead.

In this example I’ll convert the dict to a JSON str.

import json

sample_json = json.dumps(sample_dict)

On my system, Pydantic v2 is installed:

import pydantic

pydantic.version.VERSION
'2.11.9'

In the JSON example, I’m using output_model_type=DataModelType.PydanticV2BaseModel to specify a v2 Pydantic output model:

from datamodel_code_generator import DataModelType

generate(
    sample_json,
    input_file_type=InputFileType.Json,
    output=pathlib.Path('/tmp/sample_json_model.py'),
    output_model_type=DataModelType.PydanticV2BaseModel
)

with open('/tmp/sample_json_model.py', 'r') as fh:
    json_model_str = fh.read()

print(json_model_str)
# generated by datamodel-codegen:
#   filename:  <stdin>
#   timestamp: 2025-09-22T08:11:25+00:00

from __future__ import annotations

from pydantic import BaseModel


class Space(BaseModel):
    ten: str
    morning: int
    him: str


class Model(BaseModel):
    wind: float
    interesting: float
    to: float
    space: Space
    against: int

The .py file can obviously be copied and modified to be used as the basis for a validation model.

It can also be used on the fly by importing directly from the file:

import importlib
import sys

spec = importlib.util.spec_from_file_location('pydantic_model', '/tmp/sample_dict_model.py')
pydantic_model = importlib.util.module_from_spec(spec)
sys.modules['pydantic_model'] = pydantic_model
spec.loader.exec_module(pydantic_model)

Having imported the module I can then use the model to validate the original sample:

pydantic_model.Model.model_validate(sample_dict)
Model(wind=6.50703274499078, interesting=34.6591059348561, to=690698.415313144, space=Space(ten='NoVHnFEHRdQDnxsnwHRL', morning=9547, him='https://vargas.net/blog/blog/postsabout.php'), against=419)

And just to see what happens with invalid data:

try:
    pydantic_model.Model.model_validate({'wrong': 'dict'})
except Exception as e:
    print(type(e).__name__)
ValidationError

datamodel-code-generator and Pydantic are easy to use in just a few lines of code but also very complex and powerful tools. Together they are a massive help with discovery and validation.


Banner image by Freepik